About parentheses

roger64 · 06-26-2014, 02:31 AM

Hi

Most of the time - because there are some exceptions -, there should be a matching pair, with an opening and a closing one. They should be "matched", which means also we should see two straight ones, sometimes two italic ones, but we should not have a straight one on one side and an italic one on the other (or inversely

).

As I do not know how to solve this mistake with a word processor, it lands within the EPUB. It could be part of a larger problem because it interests potentially not only round brackets (parentheses) but maybe square, curly and angle brackets too.

Here is a text excerpt with some mistakes of this kind I'd like to know how to detect and/or correct. The culprit can be found three times after the word "Soir".

Code:

Charles Baudinat (de <i>France-Soir)</i>, Vladimir Bentz à Berlin, Danièle Berthemet, Sabine Cayrol, Henriette Chandet (du fin fond de <i>Paris-Soir),</i> Jacques Chapus (<i>France Soir), </i>Jacques-Olivier Chatard à Londres,

The correct form should be:

Code:

 (de <i>France-Soir</i>),

Nota: Instead of and like here, we find often a span:
 and 

Questions
How to detect an orphaned parenthesis?
How to detect an unmatched pair? (I mean both straight and italic)

Tex2002ans · 06-26-2014, 03:23 AM

I tend to work from a lot of PDF -> EPUB, so there are A TON of inconsistent formatting errors just like this... so in my case, the best bet is to just pull out all of the punctuation on the edges of the italics, and then add them on a case-by-case basis if need be.

These are the two sets of Regex that I personally use.

Regex #1:

Search: ([‘“\(])
Replace: \1

RED can be replaced with if needed.

BLUE currently has a "left single quote" + "left double quote" + "left parenthesis". You can replace that section with whatever characters you want outside of the italics. Perhaps you might want to toss a space in there as well.

Regex #2:

Search: ([;’”,\)\]\.])
Replace: \1

RED can be replaced with if needed.

BLUE currently has a "semicolon" + "right single quote" + "right double quote" + "comma" + "right parenthesis" + "right bracket" + "period". You can replace that section with whatever characters you want outside of the italics. Perhaps you might want to toss a space in there as well + a colon.

Note: I tend to avoid sticking a colon in this Regex. I would take care of those manually later. I personally find a hell of a lot more false positives on the colon than all the other punctuation.

Warnings on Regex #2:

If you use a lot of OTHER spans, this search/replace may pull out a lot of punctuation you don't want to move. You may just want to decide on a case-by-case basis instead of Replace All.

Side Note: In all of my work, I have everything stripped down to just bold and italics, THEN I run the Regex. I add in more complex classes at a much later step, so I personally have zero extraneous spans laying around.

If you use a lot of named entities ("nbsp;"), Regex #2 will break them ("nsbp;"). You may just want to run this Search/Replace afterwards:

Search: nbsp;
Replace: nbsp;

Or depending on how many of these you have in your source material, you may just want to remove the semicolon from the Regex #2.

How I use it:

I tend to just run these two Regex as Replace All multiple times (it helps me sometimes spot errors depending on how many times I have to Replace).

If I Replace All more than 4 times, and am STILL replacing characters, I know that there must be some sort of formatting I have to keep an eye out for. (Example, in some TOC or tables, periods go straight across the page).

Example:

If I Replace All six times, it may lead me to this:

Code:

<td><span class="italics">1.04....</span>......</td>

Which I can then just easily fix up.

Quote:

Originally Posted by roger64

Questions
How to detect an orphaned parenthesis?
How to detect an unmatched pair? (I mean both straight and italic)

Once you get everything stripped down and cleaned up using the Regex above, THEN it should be pretty easy to use a multitude of Regex to reinsert the punctuation into the italics if needed.

For example:

Search: [\(]([^<]+)[\)]
Replace: (\1)

would catch this:

Code:

(<i>Example words inside parenthesis</i>)

and change it into this:

Code:

<i>(Example words inside parenthesis)</i>

roger64 · 06-26-2014, 03:48 AM

@Tex2202ans

WOW! Thanks a lot for this!!

I dreamt about it. Tex did it!

I'll use it for sure (I hope the regexes are compatible with the Calibre editor?)

Tex2002ans · 06-26-2014, 04:08 AM

Quote:

Originally Posted by roger64

I'll use it for sure (I hope the regexes are compatible with the Calibre editor?)

No idea, but I use those two Regexes in Sigil nearly every single book. So they are truly tested (220+ books later, they have not let me down).

Sigil auto-cleans up the spaces before the closing tags, so that is why I don't actually include them in the Regex. I am not too sure what Calibre does when it cleans the code.

roger64 · 06-26-2014, 04:22 AM

OK.

One of the good things about the Calibre editor is that it systematically gets rid of the unloved   to replace it with its UTF-8 equivalent.

mzmm · 06-26-2014, 06:33 AM

Quote:

Originally Posted by Tex2002ans

Search: ([‘“\(])
Replace: \1

Search: ([;’”,\)\]\.])
Replace: \1

Search: [\(]([^<]+)[\)]
Replace: (\1)

just thought i'd throw in that you don't need to escape most metacharacters inside a character class, so you could rewrite

([‘“\(])
([‘“(])

([;’”,\)\]\.])
([];’”,).]) <-- closing ] after the first is ignored

[\(]([^<]+)[\)]
[(]([^<]+)[)]

Tex2002ans · 07-01-2014, 10:09 PM

Quote:

Originally Posted by mzmm

just thought i'd throw in that you don't need to escape most metacharacters inside a character class, so you could rewrite

[...]

Thanks a lot for the info.

That Regex was just one of the things I created WAYYYY back when I first started figuring out Regex, and since it continued to work so well, I just didn't mess with it. And better to be safe with escapes than sorry!

I actually stumbled across a few cases in the past few days of left and right brackets '[' ']', might have to be added in to Regex #1 and Regex #2.

ALSO, there is the odd case I forgot to mention of the wrong punctuation being italicized (QUITE common OCR error). For example,

Quote:

Stigler, George. 1961. “The Economics of Information.” Journal of Political Economy 69.

As you can see here, the RIGHT double quote is included in the italics, but isn't in my Regex #1.

I typically tackle these on a case-by-case basis at a later date (sometimes I can spot other errors when this occurs). For example, quite often a quotation mark can be the wrong way around, OR, the "smart punctuation" algorithm went haywire, and an anomaly occurred.

06-26-2014, 04:22 AM	#5
roger64 Wizard Posts: 2,625 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	OK. One of the good things about the Calibre editor is that it systematically gets rid of the unloved * * to replace it with its UTF-8 equivalent. Last edited by roger64; 06-26-2014 at 05:02 AM.

06-26-2014, 02:31 AM	#1
roger64 Wizard Posts: 2,625 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	About parentheses Hi Most of the time - because there are some exceptions -, there should be a matching pair, with an opening and a closing one. They should be "matched", which means also we should see two straight ones, sometimes two italic ones, but we should not have a straight one on one side and an italic one on the other (or inversely ). As I do not know how to solve this mistake with a word processor, it lands within the EPUB. It could be part of a larger problem because it interests potentially not only round brackets (parentheses) but maybe square, curly and angle brackets too. Here is a text excerpt with some mistakes of this kind I'd like to know how to detect and/or correct. The culprit can be found three times after the word "Soir". Code: Charles Baudinat (de <i>France-Soir)</i>, Vladimir Bentz à Berlin, Danièle Berthemet, Sabine Cayrol, Henriette Chandet (du fin fond de <i>Paris-Soir),</i> Jacques Chapus (<i>France Soir), </i>Jacques-Olivier Chatard à Londres, The correct form should be: Code: (de <i>France-Soir</i>), Nota: Instead of<i> and </i> like here, we find often a span: <span class="italic"> and </span> Questions How to detect an orphaned parenthesis? How to detect an unmatched pair? (I mean both straight and italic) Last edited by roger64; 06-26-2014 at 03:27 AM.

06-26-2014, 03:48 AM	#3
roger64 Wizard Posts: 2,625 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	@Tex2202ans WOW! Thanks a lot for this!! I dreamt about it. Tex did it! I'll use it for sure (I hope the regexes are compatible with the Calibre editor?)

Advert

Advert