MobileRead Forums - View Single Post

Tex2002ans · 06-26-2014, 03:23 AM

I tend to work from a lot of PDF -> EPUB, so there are A TON of inconsistent formatting errors just like this... so in my case, the best bet is to just pull out all of the punctuation on the edges of the italics, and then add them on a case-by-case basis if need be.

These are the two sets of Regex that I personally use.

Regex #1:

Search: ([‘“\(])
Replace: \1

RED can be replaced with if needed.

BLUE currently has a "left single quote" + "left double quote" + "left parenthesis". You can replace that section with whatever characters you want outside of the italics. Perhaps you might want to toss a space in there as well.

Regex #2:

Search: ([;’”,\)\]\.])
Replace: \1

RED can be replaced with if needed.

BLUE currently has a "semicolon" + "right single quote" + "right double quote" + "comma" + "right parenthesis" + "right bracket" + "period". You can replace that section with whatever characters you want outside of the italics. Perhaps you might want to toss a space in there as well + a colon.

Note: I tend to avoid sticking a colon in this Regex. I would take care of those manually later. I personally find a hell of a lot more false positives on the colon than all the other punctuation.

Warnings on Regex #2:

If you use a lot of OTHER spans, this search/replace may pull out a lot of punctuation you don't want to move. You may just want to decide on a case-by-case basis instead of Replace All.

Side Note: In all of my work, I have everything stripped down to just bold and italics, THEN I run the Regex. I add in more complex classes at a much later step, so I personally have zero extraneous spans laying around.

If you use a lot of named entities ("nbsp;"), Regex #2 will break them ("nsbp;"). You may just want to run this Search/Replace afterwards:

Search: nbsp;
Replace: nbsp;

Or depending on how many of these you have in your source material, you may just want to remove the semicolon from the Regex #2.

How I use it:

I tend to just run these two Regex as Replace All multiple times (it helps me sometimes spot errors depending on how many times I have to Replace).

If I Replace All more than 4 times, and am STILL replacing characters, I know that there must be some sort of formatting I have to keep an eye out for. (Example, in some TOC or tables, periods go straight across the page).

Example:

If I Replace All six times, it may lead me to this:

Code:

<td><span class="italics">1.04....</span>......</td>

Which I can then just easily fix up.

Quote:

Originally Posted by roger64

Questions
How to detect an orphaned parenthesis?
How to detect an unmatched pair? (I mean both straight and italic)

Once you get everything stripped down and cleaned up using the Regex above, THEN it should be pretty easy to use a multitude of Regex to reinsert the punctuation into the italics if needed.

For example:

Search: [\(]([^<]+)[\)]
Replace: (\1)

would catch this:

Code:

(<i>Example words inside parenthesis</i>)

and change it into this:

Code:

<i>(Example words inside parenthesis)</i>