Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > ePub

Notices

Reply
 
Thread Tools Search this Thread
Old 06-26-2014, 01:31 AM   #1
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 1,409
Karma: 846401
Join Date: Jan 2009
Device: KoboGlo
About parentheses

Hi

Most of the time - because there are some exceptions -, there should be a matching pair, with an opening and a closing one. They should be "matched", which means also we should see two straight ones, sometimes two italic ones, but we should not have a straight one on one side and an italic one on the other (or inversely ).

As I do not know how to solve this mistake with a word processor, it lands within the EPUB. It could be part of a larger problem because it interests potentially not only round brackets (parentheses) but maybe square, curly and angle brackets too.

Here is a text excerpt with some mistakes of this kind I'd like to know how to detect and/or correct. The culprit can be found three times after the word "Soir".

Code:
Charles Baudinat (de <i>France-Soir)</i>, Vladimir Bentz à Berlin, Danièle Berthemet, Sabine Cayrol, Henriette Chandet (du fin fond de <i>Paris-Soir),</i> Jacques Chapus (<i>France Soir), </i>Jacques-Olivier Chatard à Londres,
The correct form should be:

Code:
 (de <i>France-Soir</i>),
Nota: Instead of<i> and </i> like here, we find often a span:
<span class="italic"> and </span>


Questions
How to detect an orphaned parenthesis?
How to detect an unmatched pair? (I mean both straight and italic)

Last edited by roger64; 06-26-2014 at 02:27 AM.
roger64 is online now   Reply With Quote
Old 06-26-2014, 02:23 AM   #2
Tex2002ans
Evangelist
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 445
Karma: 360271
Join Date: Jul 2012
Device: Nook
I tend to work from a lot of PDF -> EPUB, so there are A TON of inconsistent formatting errors just like this... so in my case, the best bet is to just pull out all of the punctuation on the edges of the italics, and then add them on a case-by-case basis if need be.

These are the two sets of Regex that I personally use.

Regex #1:

Search: <span class="italics">([‘“\(])
Replace: \1<span class="italics">

RED can be replaced with <i> if needed.

BLUE currently has a "left single quote" + "left double quote" + "left parenthesis". You can replace that section with whatever characters you want outside of the italics. Perhaps you might want to toss a space in there as well.

Regex #2:

Search: ([;’”,\)\]\.])</span>
Replace: </span>\1

RED can be replaced with </i> if needed.

BLUE currently has a "semicolon" + "right single quote" + "right double quote" + "comma" + "right parenthesis" + "right bracket" + "period". You can replace that section with whatever characters you want outside of the italics. Perhaps you might want to toss a space in there as well + a colon.

Note: I tend to avoid sticking a colon in this Regex. I would take care of those manually later. I personally find a hell of a lot more false positives on the colon than all the other punctuation.

Warnings on Regex #2:

If you use a lot of OTHER spans, this search/replace may pull out a lot of punctuation you don't want to move. You may just want to decide on a case-by-case basis instead of Replace All.

Side Note: In all of my work, I have everything stripped down to just bold and italics, THEN I run the Regex. I add in more complex classes at a much later step, so I personally have zero extraneous spans laying around.

If you use a lot of named entities ("nbsp;"), Regex #2 will break them ("nsbp</span>;"). You may just want to run this Search/Replace afterwards:

Search: nbsp</span>;
Replace: nbsp;</span>

Or depending on how many of these you have in your source material, you may just want to remove the semicolon from the Regex #2.

How I use it:

I tend to just run these two Regex as Replace All multiple times (it helps me sometimes spot errors depending on how many times I have to Replace).

If I Replace All more than 4 times, and am STILL replacing characters, I know that there must be some sort of formatting I have to keep an eye out for. (Example, in some TOC or tables, periods go straight across the page).

Example:

If I Replace All six times, it may lead me to this:

Code:
<td><span class="italics">1.04....</span>......</td>
Which I can then just easily fix up.

Quote:
Originally Posted by roger64 View Post
Questions
How to detect an orphaned parenthesis?
How to detect an unmatched pair? (I mean both straight and italic)
Once you get everything stripped down and cleaned up using the Regex above, THEN it should be pretty easy to use a multitude of Regex to reinsert the punctuation into the italics if needed.

For example:

Search: [\(]<i>([^<]+)</i>[\)]
Replace: <i>(\1)</i>

would catch this:

Code:
(<i>Example words inside parenthesis</i>)
and change it into this:

Code:
<i>(Example words inside parenthesis)</i>

Last edited by Tex2002ans; 06-26-2014 at 02:40 AM.
Tex2002ans is offline   Reply With Quote
Old 06-26-2014, 02:48 AM   #3
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 1,409
Karma: 846401
Join Date: Jan 2009
Device: KoboGlo
@Tex2202ans

WOW! Thanks a lot for this!!

I dreamt about it. Tex did it!

I'll use it for sure (I hope the regexes are compatible with the Calibre editor?)
roger64 is online now   Reply With Quote
Old 06-26-2014, 03:08 AM   #4
Tex2002ans
Evangelist
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 445
Karma: 360271
Join Date: Jul 2012
Device: Nook
Quote:
Originally Posted by roger64 View Post
I'll use it for sure (I hope the regexes are compatible with the Calibre editor?)
No idea, but I use those two Regexes in Sigil nearly every single book. So they are truly tested (220+ books later, they have not let me down).

Sigil auto-cleans up the spaces before the closing tags, so that is why I don't actually include them in the Regex. I am not too sure what Calibre does when it cleans the code.
Tex2002ans is offline   Reply With Quote
Old 06-26-2014, 03:22 AM   #5
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 1,409
Karma: 846401
Join Date: Jan 2009
Device: KoboGlo
OK.

One of the good things about the Calibre editor is that it systematically gets rid of the unloved &nbsp; to replace it with its UTF-8 equivalent.

Last edited by roger64; 06-26-2014 at 04:02 AM.
roger64 is online now   Reply With Quote
Old 06-26-2014, 05:33 AM   #6
mzmm
Groupie
mzmm has not lost his or her sense of wonder.mzmm has not lost his or her sense of wonder.mzmm has not lost his or her sense of wonder.mzmm has not lost his or her sense of wonder.mzmm has not lost his or her sense of wonder.mzmm has not lost his or her sense of wonder.mzmm has not lost his or her sense of wonder.mzmm has not lost his or her sense of wonder.mzmm has not lost his or her sense of wonder.mzmm has not lost his or her sense of wonder.mzmm has not lost his or her sense of wonder.
 
mzmm's Avatar
 
Posts: 156
Karma: 86115
Join Date: Feb 2012
Device: iPad, Kindle Touch, Sony PRS-T1
Quote:
Originally Posted by Tex2002ans View Post
Search: <span class="italics">([‘“\(])
Replace: \1<span class="italics">

Search: ([;’”,\)\]\.])</span>
Replace: </span>\1

Search: [\(]<i>([^<]+)</i>[\)]
Replace: <i>(\1)</i>
just thought i'd throw in that you don't need to escape most metacharacters inside a character class, so you could rewrite


([‘“\(])
([‘“(])

([;’”,\)\]\.])
([];’”,).]) <-- closing ] after the first is ignored

[\(]<i>([^<]+)</i>[\)]
[(]<i>([^<]+)</i>[)]
mzmm is offline   Reply With Quote
Old 07-01-2014, 09:09 PM   #7
Tex2002ans
Evangelist
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 445
Karma: 360271
Join Date: Jul 2012
Device: Nook
Quote:
Originally Posted by mzmm View Post
just thought i'd throw in that you don't need to escape most metacharacters inside a character class, so you could rewrite

[...]
Thanks a lot for the info.

That Regex was just one of the things I created WAYYYY back when I first started figuring out Regex, and since it continued to work so well, I just didn't mess with it. And better to be safe with escapes than sorry!

I actually stumbled across a few cases in the past few days of left and right brackets '[' ']', might have to be added in to Regex #1 and Regex #2.

ALSO, there is the odd case I forgot to mention of the wrong punctuation being italicized (QUITE common OCR error). For example,

Quote:
<p>Stigler, George. 1961. “The Economics of Information.<span class="italics">” Journal of Political Economy</span> 69.</p>
As you can see here, the RIGHT double quote is included in the italics, but isn't in my Regex #1.

I typically tackle these on a case-by-case basis at a later date (sometimes I can spot other errors when this occurs). For example, quite often a quotation mark can be the wrong way around, OR, the "smart punctuation" algorithm went haywire, and an anomaly occurred.
Tex2002ans is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump


All times are GMT -4. The time now is 03:40 PM.


MobileRead.com is a privately owned, operated and funded community.