11-11-2014, 06:28 PM | #1 | |
Chief Bohemian Misfit
Posts: 571
Karma: 462964
Join Date: May 2013
Device: iPad, ADE
|
Using regex for more elegant hyphenation and word wrap
Wow, I only just learned today what "regex" means -- I've seen it here and there in different programs, but never had a clue what exactly it was for until now (duh) -- and what world of possibilities it might open up for me in simplifying a couple of things that I've been very laboriously doing "manually" so far. I've been reading up all afternoon on regex, though, and I'm still confused on how to go about doing what I want to do, so I hope someone out there can help me come up with the right regex expressions to use.
Basically there's two separate things that I've been doing in order to make my books a little more "elegant," typographically. PROBLEM #1 - More Selective Hyphenation I'm primarily an iPad user (forgive me), but I hate the way that it automatically hyphenates words willy-nilly all over the place, even shorter words that didn't need to be, and so what I did to counter that was initially turn hyphenation off in my book completely, by adding this in my styles (wherever I wanted hyphenation to be turned off)... Code:
-webkit-hyphens:none; -epub-hyphens:none; -moz-hyphens:none; adobe-hyphenate: none; hyphens:none; Code:
.hyph { hyphens:auto; -webkit-hyphens:auto; -epub-hyphens:auto; -moz-hyphens:auto; adobe-hyphenate: auto; } Code:
<p>Here's a paragraph with an <span class="hyph">unreasonably</span> long word.</p> What I'd like to search for is something to the effect of this... [space] + [a word with at least 8 characters] + [a space OR any number of alphanumeric characters] ...and then for the replace function I want to wrap <span class="hyph"></span> around the 8+ character word and -- if it's not too ridiculous a thing to ask -- ALSO any number of punctuation marks that might come after it, but NOT if it's a space, then just close the span right after the word. If this latter is getting too weird, then wrapping it around just the word would be fine, too. The point of searching for a [space] before the word is because if, say, it's a long word at the very beginning of a paragraph (<p>), then obviously that doesn't need to be hyphenated (unless the first word happened to be "supercalifragilisticexpialidocious" or something). Does that make sense, what I'm trying to do here? I'm having some problems grasping this regex stuff more generally, just for starters, but the biggest thing I can't figure out is how to search for words that would be 8 characters or longer (and ignore all shorter words). PROBLEM #2 - Selectively Preventing Word Wrap Another "typographically-annoying" thing is whenever a line happens to end with the first word of a new sentence (or a phrase after punctuation mark) which starts with a single-letter word -- which, as far as I can come up with, would be "I" or "A" or "a", or, in rarer instances, "O". Here's a made-up example of an especially annoying paragraph... Quote:
[any punctuation mark] + [space] + "I" + [space OR punctuation mark + a space] + [word of 5 characters or less, but not longer] ...and then replace that by wrapping the "I" and the following word (if it's 5 characters or less) with <span class="nowrap"></span>, where the "nowrap" class is... Code:
.nowrap { white-space: nowrap; } I hope you all don't think I'm crazy for nit-picking over hyphenation and word wrap like this, but, well, maybe I actually am crazy. Nevertheless, I've been doing this "manually" all along so far, and wow, what an enormous time saver it would make if I could come up with a regex expression that could do this with a simple search & replace instead! I spent the whole afternoon trying to figure this out, though, I just can't seem to come up with how to do it, though, what expressions I would use. Can anyone help? Last edited by Psymon; 11-11-2014 at 06:31 PM. |
|
11-11-2014, 07:45 PM | #2 | ||||||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
The ONLY spot I can MAYBE see disabling hyphenation being useful, is if you wanted to disable it in headings. Besides that, it is not recommended. I DEFINITELY don't recommend disabling it everywhere, and ENABLING it on certain words (if anything, you would do the exact opposite). Side Note: This reminds me a lot of the soft-hyphenation talk. There is even that Calibre Plugin, "Hyphenate This!", which is dedicated towards adding in soft-hyphens everywhere under the sun: https://www.mobileread.com/forums/sho...d.php?t=208534 Here is more talk on the soft-hyphen problem: https://www.mobileread.com/forums/sho...d.php?t=230358 Best to leave hyphenation as a user choice, which they can either enable/disable in their reader. Other Side Note: Here is Wikipedia's page on Widows/Orphans (italics mine): https://en.wikipedia.org/wiki/Widows_and_orphans Quote:
Quote:
IF, and that is a big IF, you wanted that much control over the look, you might as well just go "Fixed Format". (Which brings along its own host of problems). Quote:
http://www.regular-expressions.info/tutorial.html To do the above, you would want something along these lines: Search: (\b\w{8,}\b) Replace: <span class="hyph">\1</span> \b = a "Word Boundary", you can read up on that here: http://www.regular-expressions.info/wordboundaries.html \w = any "Word Character", you can read up on that here: http://www.regular-expressions.info/shorthand.html {8,} = 8 or more characters So, in English, this says "Find a Word Boundary, then any 8 or more Word Characters in a row, followed by another Word Boundary". Since the entire thing is surrounded by parenthesis, this says, stick this entire thing in a capture point \1. Then take everything in \1, and "wrap that entire thing with <span class="hyph"></span>". Quote:
Look, at a certain point, you have to accept that reflowable ebooks ARE NOT PRINT. #1: Give up trying to make them print. #2: The EPUB standards are just not there to support a lot of the complex typographical decisions. If you want to do all of that typographical nitpicking in EPUB, you will have to go Fixed Format, OR, just create a PDF using whatever tools (LaTeX, Quark, InDesign, etc. etc.). Side Note: For example, in French Typography, there seems to be this weird rule of "the last line of a paragraph should not be shorter than the double of the indentation of the next paragraph.": https://tex.stackexchange.com/questi...ne/28361#28361 This sort of weird conventions are just NOT POSSIBLE in EPUB. Other Side Note: Sometimes, I really wonder how typographers survive on the Internet. Their eyeballs/brains must be going crazy from websites not following typography rules, and users reading at all different font sizes + device/monitor sizes. Where they want pixel/mm perfect typography, the rest of us want resizable/reflowable/customizable. Quote:
Oh yeah, Regex is a super time saver. Now I can do stuff in a few clicks that used to take me many hours (for example, fixing up Indexes, catching typos in page numbers, adding en dashes between numbers, etc. etc.). Last edited by Tex2002ans; 11-11-2014 at 08:20 PM. |
||||||
Advert | |
|
11-11-2014, 08:25 PM | #3 |
Grand Sorcerer
Posts: 27,546
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
I agree entirely with the previous post. If you want to control hyphenation for a personal document on a specific device for your own use, that's one thing. Have at it. But I wouldn't recommend trying to force something that should be a user/device preference. For one thing, you have no way of knowing where "problematic" lines might create unwanted whitespace when you don't know what fonts/fontsizes readers are using.
|
11-11-2014, 09:06 PM | #4 |
Chief Bohemian Misfit
Posts: 571
Karma: 462964
Join Date: May 2013
Device: iPad, ADE
|
For what it's worth, with regard to the hypenation thing I've been doing with my previous (and in-progress) books, in doing it the way I outlined, and testing it out on the iPad in all font sizes, both in portrait and landscape orientation, everything turned out remarkably well, way better than just leaving things to the "default" (i.e. by having not done what I did at all). Not only is the text more visually/aesthetically pleasing to look at, without a zillion needless hyphenations all over the place, but there's not a single instance of "large white spaces" anywhere, and hence, as a result of all that, not only is the book more visually "pleasurable" just too look at (from a design perspective) but overall ease-of-readability is improved, too (because there's fewer hyphenated words all over the place).
I can appreciate both your concerns -- if what you're concerned about was actually happening -- but in fact the exact opposite of what you're concerned about is the end result of doing what I did. |
11-11-2014, 09:28 PM | #5 |
Grand Sorcerer
Posts: 27,546
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
|
Advert | |
|
11-11-2014, 09:31 PM | #6 |
Chief Bohemian Misfit
Posts: 571
Karma: 462964
Join Date: May 2013
Device: iPad, ADE
|
Why would that be "troubling"?
|
11-11-2014, 10:13 PM | #7 |
Grand Sorcerer
Posts: 27,546
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
I don't believe anyone's qualified to determine what might be "aesthetically pleasing" for anyone other than themselves. Other than some basic formatting and maybe a few simple flourishes/frills, I believe ebook creators should stay (mostly) out of the way of users and their preferred settings on their preferred readers (with regard to basic default body text formatting). They (readers) should be free to make the decisions about what pleases them aesthetically, rather than having it dictated.
I realize not everyone feels that way--and that's fine. I'm not going to harp on it other than what I've already said. |
11-11-2014, 10:20 PM | #8 | |
Chief Bohemian Misfit
Posts: 571
Karma: 462964
Join Date: May 2013
Device: iPad, ADE
|
Quote:
Of course, that latter, as you say, is a judgement call -- but as I said, I don't know too many people (if any) who think that more hyphenation is better, just because you (or the software) "can." |
|
11-11-2014, 10:23 PM | #9 |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
What about non-iPad devices, what about iPad #X (future device)?
What about iPhone? I could see hyphenation enabling/disabling causing a much worse problem when you don't have much left/right space available to fit many characters. What about someone who reads on an iPhone with huge margins on the edges? What about someone that reads with no margins? What about someone who chooses their own fonts? What about someone who wants to read in Marvin instead? *Insert huge list of changes/questions here*. What if they want to convert this EPUB to Format X and read it in Program XYZ? Your hyphenation code will most likely not be transferred over at all. (And at worst, might get in the way/break something else... for example, soft-hyphens, while they "look nice", break search functionality in many devices). You should stay as out of the way of the user/reader as possible, and only giving very general guidance with your CSS to tell the device how to treat the book. It goes back to the ol' argument between "specific" versus the "broad" fixes. Your problem is with the CURRENT iBooks hyphenation algorithm. So you add in all of these SPECIFIC manual tweaks, to try to make it look "more aesthetically pleasing"... but the real problem should be geared towards the READER/DEVICE level, and complaining to iBooks to update/tweak their hyphenation algorithm! Or heck, what if I did want to read WITHOUT hyphenation, your code will get in my way. (And I DO disable any and all hyphenation when I read, because I want to catch actual typos/errors, and not see the auto soft-hyphens all over the place). What if I wanted to read left-aligned text, with no hyphenation (maybe it reminds me of MobileRead posts!). You enabling/disabling hyphenation with individual spans would make me angry. Last edited by Tex2002ans; 11-11-2014 at 10:37 PM. |
11-11-2014, 10:29 PM | #10 | |
Grand Sorcerer
Posts: 27,546
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
|
|
11-11-2014, 10:39 PM | #11 | |
Chief Bohemian Misfit
Posts: 571
Karma: 462964
Join Date: May 2013
Device: iPad, ADE
|
Quote:
well... have you got a brilliant regex script for me that will get rid of every instance of <span class="hyph"></span> without getting rid of the words in-between, and not getting rid of any other spans, and then I'll just allow hyphenation everywhere, let the software do whatever it want (and which it does), willy-nilly all over the place? Don't get me wrong, I'll take your (plural) advice, but I do actually think this totally, totally sucks. Of course, I also felt it sucked the other way, too, which compelled me to add in all that extra coding in the first place. :/ |
|
11-11-2014, 10:48 PM | #12 |
Grand Sorcerer
Posts: 12,160
Karma: 73448616
Join Date: Nov 2007
Location: Toronto
Device: Nexus 7, Clara, Touch, Tolino EPOS
|
Try:
Search: <span class="hyph">(.+?)</span> Replace: \1 Or just revert to a backup you had made of that ePub before making those changes. |
11-11-2014, 11:56 PM | #13 | |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
All these readers do just like your typical Word Processor... only change the spacing between WORDS. Side Note: Well, kerning can be done at the font level, but meh, someone can always choose a new font, and your work goes out the window. Much of this micro-typography is being implemented in CSS3, but meh, I don't know how well it is going to work, or be supported by reading devices. Or let us take the hyphenation algorithm itself, the amount of hyphens in a single paragraph should really be minimized as much as possible (and heaven forbid, two lines MUST NOT have hyphens in a row). Here, you can see a comparison in justification/hyphenation between Word/InDesign/LaTeX: https://tex.stackexchange.com/questi...esetting-ligat Then if you REALLY want to get more into the typography rules of hyphenation, there are rules such as "a person's last name SHOULD NOT be hyphenated"... so you have to start wrapping all of those in <span class="donthyphenate">LastName</span>. You should also avoid doing lots of things, because they are "more aesthetically pleasing".... but next thing you know, you have a huge mess of code like InDesign or Word outputs! As I said, best to leave it up at the device/reader level, than to get super nitpicky. Too many variables for you to worry about. This is easy if you know the EXACT page size, and the EXACT font, and the EXACT font size (like if you are designing a print book)... but you start changing any of those variables, and most of your hard work goes out the window (or gets in the way/causes problems elsewhere). Anyway, not stopping you from wanting to do all that hyphenation/nitpickyness.... but please, just don't globally disable/enable hyphenation via CSS. Use it sparingly (as was mentioned, maybe only disable hyphens in a <h1>, <h2>, <h3>). Last edited by Tex2002ans; 11-12-2014 at 12:15 AM. |
|
11-12-2014, 08:03 AM | #14 | |
Wizard
Posts: 1,539
Karma: 6613969
Join Date: Mar 2013
Location: Rosario - Santa Fe - Argentina
Device: Kindle 4 NT
|
Quote:
Regards |
|
11-12-2014, 10:44 AM | #15 | |
Well trained by Cats
Posts: 29,782
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
But a forced (upon you) cure is (IMHO) never (and having no dash is really terrible ) Take the K4 (I have one). 40px of wasted screen , MINIMUM l/r margins, space, is not pleasing. to me. (I have a fragile hack in place that sets it to 10). I love my books to look nice, Not a stark, Joe Friday style: "Just the words, M'am. Just the words." . But I also understand that some of the tricks-o-the-trade Typesetters (Cold or Hot) used when letters stayed put, no longer apply when the page can flow. IMHO Fixed layout should be reserved for those extra special cases , where free flowing text makes a hash out of the meaning of the work. I have also seen the mess MRSDK can make using Widows and Orphans . The cure can backfire. IMHO Avoid using force. <RANT> Apple is NOT the metric for E-Books. Just say i-won't! force the Apple way on others. There are still more other brand-models in use as reading devices than there are from the big(-headed) Apple. </RANT> |
|
Tags |
hyphenation, regex, search & replace, word wrap |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
word wrap issues on Kindle from .txt | rbdavis | Conversion | 9 | 02-08-2011 07:55 AM |
Q: Tables, images, and word-wrap | AndrewH | Workshop | 2 | 12-22-2010 02:34 AM |
Sheet To Go -- Word Wrap in Cells? | kenjennings | enTourage Archive | 0 | 05-06-2010 10:34 AM |
Word wrap in the forum [closed] | JSWolf | Lounge | 51 | 11-11-2007 10:22 PM |