![]() |
#1 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,624
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
About parentheses
Hi
Most of the time - because there are some exceptions -, there should be a matching pair, with an opening and a closing one. They should be "matched", which means also we should see two straight ones, sometimes two italic ones, but we should not have a straight one on one side and an italic one on the other (or inversely ![]() As I do not know how to solve this mistake with a word processor, it lands within the EPUB. It could be part of a larger problem because it interests potentially not only round brackets (parentheses) but maybe square, curly and angle brackets too. Here is a text excerpt with some mistakes of this kind I'd like to know how to detect and/or correct. The culprit can be found three times after the word "Soir". Code:
Charles Baudinat (de <i>France-Soir)</i>, Vladimir Bentz à Berlin, Danièle Berthemet, Sabine Cayrol, Henriette Chandet (du fin fond de <i>Paris-Soir),</i> Jacques Chapus (<i>France Soir), </i>Jacques-Olivier Chatard à Londres, Code:
(de <i>France-Soir</i>), <span class="italic"> and </span> Questions How to detect an orphaned parenthesis? How to detect an unmatched pair? (I mean both straight and italic) Last edited by roger64; 06-26-2014 at 02:27 AM. |
![]() |
![]() |
![]() |
#2 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
I tend to work from a lot of PDF -> EPUB, so there are A TON of inconsistent formatting errors just like this... so in my case, the best bet is to just pull out all of the punctuation on the edges of the italics, and then add them on a case-by-case basis if need be.
These are the two sets of Regex that I personally use. Regex #1: Search: <span class="italics">([‘“\(]) Replace: \1<span class="italics"> RED can be replaced with <i> if needed. BLUE currently has a "left single quote" + "left double quote" + "left parenthesis". You can replace that section with whatever characters you want outside of the italics. Perhaps you might want to toss a space in there as well. Regex #2: Search: ([;’”,\)\]\.])</span> Replace: </span>\1 RED can be replaced with </i> if needed. BLUE currently has a "semicolon" + "right single quote" + "right double quote" + "comma" + "right parenthesis" + "right bracket" + "period". You can replace that section with whatever characters you want outside of the italics. Perhaps you might want to toss a space in there as well + a colon. Note: I tend to avoid sticking a colon in this Regex. I would take care of those manually later. I personally find a hell of a lot more false positives on the colon than all the other punctuation. Warnings on Regex #2: If you use a lot of OTHER spans, this search/replace may pull out a lot of punctuation you don't want to move. You may just want to decide on a case-by-case basis instead of Replace All. Side Note: In all of my work, I have everything stripped down to just bold and italics, THEN I run the Regex. I add in more complex classes at a much later step, so I personally have zero extraneous spans laying around. If you use a lot of named entities ("nbsp;"), Regex #2 will break them ("nsbp</span>;"). You may just want to run this Search/Replace afterwards: Search: nbsp</span>; Replace: nbsp;</span> Or depending on how many of these you have in your source material, you may just want to remove the semicolon from the Regex #2. How I use it: I tend to just run these two Regex as Replace All multiple times (it helps me sometimes spot errors depending on how many times I have to Replace). If I Replace All more than 4 times, and am STILL replacing characters, I know that there must be some sort of formatting I have to keep an eye out for. (Example, in some TOC or tables, periods go straight across the page). Example: If I Replace All six times, it may lead me to this: Code:
<td><span class="italics">1.04....</span>......</td> Quote:
For example: Search: [\(]<i>([^<]+)</i>[\)] Replace: <i>(\1)</i> would catch this: Code:
(<i>Example words inside parenthesis</i>) Code:
<i>(Example words inside parenthesis)</i> Last edited by Tex2002ans; 06-26-2014 at 02:40 AM. |
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,624
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
@Tex2202ans
WOW! Thanks a lot for this!! ![]() ![]() I dreamt about it. Tex did it! ![]() I'll use it for sure (I hope the regexes are compatible with the Calibre editor?) |
![]() |
![]() |
![]() |
#4 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
![]() Sigil auto-cleans up the spaces before the closing tags, so that is why I don't actually include them in the Regex. I am not too sure what Calibre does when it cleans the code. |
|
![]() |
![]() |
![]() |
#5 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,624
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
OK.
One of the good things about the Calibre editor is that it systematically gets rid of the unloved to replace it with its UTF-8 equivalent. Last edited by roger64; 06-26-2014 at 04:02 AM. |
![]() |
![]() |
Advert | |
|
![]() |
#6 | |
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 171
Karma: 86271
Join Date: Feb 2012
Device: iPad, Kindle Touch, Sony PRS-T1
|
Quote:
([‘“\(]) ([‘“(]) ([;’”,\)\]\.]) ([];’”,).]) <-- closing ] after the first is ignored [\(]<i>([^<]+)</i>[\)] [(]<i>([^<]+)</i>[)] |
|
![]() |
![]() |
![]() |
#7 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
That Regex was just one of the things I created WAYYYY back when I first started figuring out Regex, and since it continued to work so well, I just didn't mess with it. And better to be safe with escapes than sorry! ![]() I actually stumbled across a few cases in the past few days of left and right brackets '[' ']', might have to be added in to Regex #1 and Regex #2. ALSO, there is the odd case I forgot to mention of the wrong punctuation being italicized (QUITE common OCR error). For example, Quote:
I typically tackle these on a case-by-case basis at a later date (sometimes I can spot other errors when this occurs). For example, quite often a quotation mark can be the wrong way around, OR, the "smart punctuation" algorithm went haywire, and an anomaly occurred. |
||
![]() |
![]() |