View Single Post
Old 04-26-2016, 05:44 PM   #4
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Question: Is there an actual space before the final closing </p>? And can it actually be relied upon?

In my experience, I wouldn't trust this with a ten foot pole, and would have to check each one on a case-by-case basis. I definitely wouldn't completely rely on a Replace All.

Regex Solutions

I would handle this specific cleanup in a few passes.

First, make sure that you SAVE A COPY before you do anything. Then make sure you don't press Replace All unless you know exactly what you are doing (and have tested a few to make sure the Regex is working properly). Even then, make sure you do a code comparison of the Before/After to make sure you didn't delete key parts of the text.

Before Examples

I would just do a simple Search and Replace to strip out all:

<p class="calibre2"></p>

and

<p class="calibre2"/>

Example #1-3

If you run the above Search/Replaces, then example #1-3 can be condensed into this:

Search: [0-9]+</p>\s+<p class="calibre2">
Replace: *BLANK OR A SPACE*

Note: In these examples, Red denotes the Regex that matches the page numbers.

Note: In English, the Red portion says "look for 1 or more numbers in a row".

The Blue portion says "look for 1 or more whitespace characters".

Note: There can be legitimate usages of numbers (for example, years/dates/ages). Be careful.

Example #4

Search: [IXVL]+</p>\s+<p class="calibre2">
Replace: *BLANK OR A SPACE*

Note: In English, Red says "look for the 1 or more 'I' + 'X' + 'L' + 'V' in a row". This should match roman numerals like "IX", "XIII", "XXIV".

Note: "I" is used very often in English, so be careful.

Note: Make sure you have the "Case-sensitive" button turned on.

Example #5

Search: \[[0-9]+\]
Replace: *BLANK OR A SPACE*

Note: In English, Red says "look for a left bracket" + "look for 1 or more numbers in a row" + "look for a right bracket".

After Examples

Beyond that point, you stated that hyphens should be removed... I would strongly recommend against this. Each one of these has to be checked on a case-by-case basis. The hyphen may actually be a hard hyphen (for example, in the word "all-purpose" might have been broken across pages).

For checking hyphens at the end of paragraphs, I personally run this regex:

Search: -</p>\s+<p>
Replace: *BLANK*

It shouldn't be too bad manually correcting these. In reality, you only have to check a handful of hyphens that were at the end of pages.

I would highly recommend learning at least the basics of Regex:

http://www.regular-expressions.info/quickstart.html

There is also a huge "Regex examples" thread in the Sigil section of the forums:

https://www.mobileread.com/forums/sho...d.php?t=167971

These examples you posted are relatively easy.

Side Note: Thanks for saving your example images as PNG. Vastly superior compared to people who post screenshots as JPG.

Last edited by Tex2002ans; 04-26-2016 at 06:12 PM.
Tex2002ans is offline   Reply With Quote