MobileRead Forums - View Single Post - Delete paragraphs in scanned books (S & R with regexes)

Tex2002ans · 04-26-2016, 05:44 PM

Question: Is there an actual space before the final closing ? And can it actually be relied upon?

In my experience, I wouldn't trust this with a ten foot pole, and would have to check each one on a case-by-case basis. I definitely wouldn't completely rely on a Replace All.

Regex Solutions

I would handle this specific cleanup in a few passes.

First, make sure that you SAVE A COPY before you do anything. Then make sure you don't press Replace All unless you know exactly what you are doing (and have tested a few to make sure the Regex is working properly). Even then, make sure you do a code comparison of the Before/After to make sure you didn't delete key parts of the text.

Before Examples

I would just do a simple Search and Replace to strip out all:



and



Example #1-3

If you run the above Search/Replaces, then example #1-3 can be condensed into this:

Search: [0-9]+\s+
Replace: *BLANK OR A SPACE*

Note: In these examples, Red denotes the Regex that matches the page numbers.

Note: In English, the Red portion says "look for 1 or more numbers in a row".

The Blue portion says "look for 1 or more whitespace characters".

Note: There can be legitimate usages of numbers (for example, years/dates/ages). Be careful.

Example #4

Search: [IXVL]+\s+
Replace: *BLANK OR A SPACE*

Note: In English, Red says "look for the 1 or more 'I' + 'X' + 'L' + 'V' in a row". This should match roman numerals like "IX", "XIII", "XXIV".

Note: "I" is used very often in English, so be careful.

Note: Make sure you have the "Case-sensitive" button turned on.

Example #5

Search: \[[0-9]+\]
Replace: *BLANK OR A SPACE*

Note: In English, Red says "look for a left bracket" + "look for 1 or more numbers in a row" + "look for a right bracket".

After Examples

Beyond that point, you stated that hyphens should be removed... I would strongly recommend against this. Each one of these has to be checked on a case-by-case basis. The hyphen may actually be a hard hyphen (for example, in the word "all-purpose" might have been broken across pages).

For checking hyphens at the end of paragraphs, I personally run this regex:

Search: -\s+
Replace: *BLANK*

It shouldn't be too bad manually correcting these. In reality, you only have to check a handful of hyphens that were at the end of pages.

I would highly recommend learning at least the basics of Regex:

http://www.regular-expressions.info/quickstart.html

There is also a huge "Regex examples" thread in the Sigil section of the forums:

https://www.mobileread.com/forums/sho...d.php?t=167971

These examples you posted are relatively easy.

Side Note: Thanks for saving your example images as PNG. Vastly superior compared to people who post screenshots as JPG.

04-26-2016, 05:44 PM	#4
Tex2002ans Wizard Posts: 2,306 Karma: 13057279 Join Date: Jul 2012 Device: Kobo Forma, Nook	Question: Is there an actual space before the final closing </p>? And can it actually be relied upon? In my experience, I wouldn't trust this with a ten foot pole, and would have to check each one on a case-by-case basis. I definitely wouldn't completely rely on a Replace All. Regex Solutions I would handle this specific cleanup in a few passes. First, make sure that you SAVE A COPY before you do anything. Then make sure you don't press Replace All unless you know exactly what you are doing (and have tested a few to make sure the Regex is working properly). Even then, make sure you do a code comparison of the Before/After to make sure you didn't delete key parts of the text. Before Examples I would just do a simple Search and Replace to strip out all: <p class="calibre2"></p> and <p class="calibre2"/> Example #1-3 If you run the above Search/Replaces, then example #1-3 can be condensed into this: Search: [0-9]+</p>\s+<p class="calibre2"> Replace: BLANK OR A SPACE Note: In these examples, Red denotes the Regex that matches the page numbers. Note: In English, the Red portion says "look for 1 or more numbers in a row". The Blue portion says "look for 1 or more whitespace characters". Note: There can be legitimate usages of numbers (for example, years/dates/ages). Be careful. Example #4 Search: [IXVL]+</p>\s+<p class="calibre2"> Replace: BLANK OR A SPACE Note: In English, Red says "look for the 1 or more 'I' + 'X' + 'L' + 'V' in a row". This should match roman numerals like "IX", "XIII", "XXIV". Note: "I" is used very often in English, so be careful. Note: Make sure you have the "Case-sensitive" button turned on. Example #5 Search: \[[0-9]+\] Replace: BLANK OR A SPACE Note: In English, Red says "look for a left bracket" + "look for 1 or more numbers in a row" + "look for a right bracket". After Examples Beyond that point, you stated that hyphens should be removed... I would strongly recommend against this. Each one of these has to be checked on a case-by-case basis. The hyphen may actually be a hard hyphen (for example, in the word "all-purpose" might have been broken across pages). For checking hyphens at the end of paragraphs, I personally run this regex: Search: -</p>\s+<p> Replace: BLANK It shouldn't be too bad manually correcting these. In reality, you only have to check a handful of hyphens that were at the end of pages. I would highly recommend learning at least the basics of Regex: http://www.regular-expressions.info/quickstart.html There is also a huge "Regex examples" thread in the Sigil section of the forums: https://www.mobileread.com/forums/sho...d.php?t=167971 These examples you posted are relatively easy. Side Note: Thanks for saving your example images as PNG. Vastly superior compared to people who post screenshots as JPG. Last edited by Tex2002ans; 04-26-2016 at 06:12 PM.