Delete paragraphs in scanned books (S & R with regexes)

chaot · 04-26-2016, 12:22 PM

Scanned books show in the view screen or e-reader often unwanted paragraphs in respect to book page numbers.

The terms may look different, but they appear en masse and therefore elimination using S & R and regaxes would be advantageous.
Marked syntax (red) should be deleted. Note the book page numbers always differ (of course).

Where are our great regex masters!?

Some examples from different books:

Example 1

Code:

keine Anzeichen für körperliche Mängel zu erkennen. </p>

  <p class="calibre2">Normal? Der US-Geheimdienst OSS (Office of Strategic 169</p>

  <p class="calibre2"></p>

  <p class="calibre2">Studies, Vorläufer der CIA), oder genauer, der von ihm

Click image for larger version

Name: Example 2.png
Views: 266
Size: 27.7 KB
ID: 148237

Example 2 Note hyphen, also to delete.

Code:

derartigen Mangel hingewiesen hätten, aber die ärztlichen Feststel-170</p>

  <p class="calibre2"></p>

  <p class="calibre2">lungen lauteten nach dem Krieg nicht anders als

Example 3

Code:

die natürlich ihre Blöße nicht deckten, denn es war </p>

  <p class="calibre2">17</p>

  <p class="calibre2"></p>

  <p class="calibre2">keiner anwesend (außer mir), der nicht mindestens seine

Example 3a

Code:

das viel zu herb und zu modisch für sie ist, irgendein <b class="calibre3">19</b></p>

  <p class="calibre2"></p>

  <p class="calibre2">Zeug, das, glaube ich, Taiga heißt, noch in der Wohnung

Click image for larger version

Name: Example 4.png
Views: 253
Size: 28.7 KB
ID: 148238

Example 4 Note Roman rather than Arabic numerals!

Code:

bewundernden Kommentare von westlichen Besuchern in Maos China, XVI </p>

  <p class="calibre2"></p>

  <p class="calibre2">dass Chinesen außerordentliche Menschen seien, die es

Example 5

Code:

ihr Büro war für die [306] Sicherheit eines Parkabschnitts zuständig.

Interna: Ex1&2 Bedürftig (AHdAb), Ex3&3a Böll (AeC), Ex4&5 Chang (WS)

Divingduck · 04-26-2016, 12:53 PM

Guess, you don't touch conversion parameters before you start your conversion work and everything is on standard setup. This is for PDF conversion not the best choice.

You need to set a better line unwrapping factor for PDF input files. Standard is .45, check out something between .15 and .25

chaot · 04-26-2016, 01:19 PM

Sorry, this has nothing to do with PDF; these are all excerpts from (now) EPUBs, were before real books scanned.

Tex2002ans · 04-26-2016, 05:44 PM

Question: Is there an actual space before the final closing ? And can it actually be relied upon?

In my experience, I wouldn't trust this with a ten foot pole, and would have to check each one on a case-by-case basis. I definitely wouldn't completely rely on a Replace All.

Regex Solutions

I would handle this specific cleanup in a few passes.

First, make sure that you SAVE A COPY before you do anything. Then make sure you don't press Replace All unless you know exactly what you are doing (and have tested a few to make sure the Regex is working properly). Even then, make sure you do a code comparison of the Before/After to make sure you didn't delete key parts of the text.

Before Examples

I would just do a simple Search and Replace to strip out all:



and



Example #1-3

If you run the above Search/Replaces, then example #1-3 can be condensed into this:

Search: [0-9]+\s+
Replace: *BLANK OR A SPACE*

Note: In these examples, Red denotes the Regex that matches the page numbers.

Note: In English, the Red portion says "look for 1 or more numbers in a row".

The Blue portion says "look for 1 or more whitespace characters".

Note: There can be legitimate usages of numbers (for example, years/dates/ages). Be careful.

Example #4

Search: [IXVL]+\s+
Replace: *BLANK OR A SPACE*

Note: In English, Red says "look for the 1 or more 'I' + 'X' + 'L' + 'V' in a row". This should match roman numerals like "IX", "XIII", "XXIV".

Note: "I" is used very often in English, so be careful.

Note: Make sure you have the "Case-sensitive" button turned on.

Example #5

Search: \[[0-9]+\]
Replace: *BLANK OR A SPACE*

Note: In English, Red says "look for a left bracket" + "look for 1 or more numbers in a row" + "look for a right bracket".

After Examples

Beyond that point, you stated that hyphens should be removed... I would strongly recommend against this. Each one of these has to be checked on a case-by-case basis. The hyphen may actually be a hard hyphen (for example, in the word "all-purpose" might have been broken across pages).

For checking hyphens at the end of paragraphs, I personally run this regex:

Search: -\s+
Replace: *BLANK*

It shouldn't be too bad manually correcting these. In reality, you only have to check a handful of hyphens that were at the end of pages.

I would highly recommend learning at least the basics of Regex:

http://www.regular-expressions.info/quickstart.html

There is also a huge "Regex examples" thread in the Sigil section of the forums:

https://www.mobileread.com/forums/sho...d.php?t=167971

These examples you posted are relatively easy.

Side Note: Thanks for saving your example images as PNG. Vastly superior compared to people who post screenshots as JPG.

chaot · 04-27-2016, 01:40 PM

@Tex2002ans, thank you very much!

That looks like a lot of work - and you will probably be able and willing to help in cause of other related questions.

Quote:

Originally Posted by Tex2002ans

Question: Is there an actual space before the final closing ? And can it actually be relied upon?

At the moment I can't answer that, subject to an investigation.

Quote:

In my experience, I wouldn't trust this with a ten foot pole, and would have to check each one on a case-by-case basis. I definitely wouldn't completely rely on a Replace All.

Nothing beats experience! The examples shown concern only 3 books, other hundreds/thousands waits. And so far, even if it is very time consuming, I will consider your advice to be careful with Replace all.
Note: Adding an Example 3a in Post #1 (same book as Example 3)

Treating the whole catalog of problems at once I often lack the Internet, that means going on selective. Simple things first.

Quote:

Regex Solutions

I would handle this specific cleanup in a few passes.

First, make sure that you SAVE A COPY before you do anything. Then make sure you don't press Replace All unless you know exactly what you are doing (and have tested a few to make sure the Regex is working properly). Even then, make sure you do a code comparison of the Before/After to make sure you didn't delete key parts of the text.

Code comparison, what kind of tool will do the job best? I got Beyond Compare, Meld and KDiff3.
Probably you mean key parts of the code!?

Quote:

Example #5

Search: \[[0-9]+\]
Replace: *BLANK OR the regex, however, should eliminate a blank space

Note: In English, Red says "look for a left bracket" + "look for 1 or more numbers in a row" + "look for a right bracket".

Replace: *BLANK OR A SPACE* will transform

Code:

ihr Büro war für die [306] Sicherheit

into

Code:

ihr Büro war für die  Sicherheit [2 blank spaces]

respectively into

Code:

ihr Büro war für die    Sicherheit [4 blank spaces]

The regex here, however, should eliminate with [306] also a blank space.
Don't be angry, I'm relatively sure the solution (for the elimination of a space) is
to find anywhere - only I would like a little sense of achievement quick and now.

What's the different in S&R between settings Regex and Regex-Function?

Quote:

I would highly recommend learning at least the basics of Regex

Will be done! Great interest exists.

Quote:

There is also a huge "Regex examples" thread in the Sigil section of the forums:

https://www.mobileread.com/forums/sho...d.php?t=167971

Stupid question!? Are these regaxes also fit for calibre?
Would may be worth to create out of all these examples there something like a (regax examples) library - you know, cataloged and without bla-bla.

Tex2002ans · 04-27-2016, 02:45 PM

Quote:

Originally Posted by chaot

Code comparison, what kind of tool will do the job best? I got Beyond Compare, Meld and KDiff3.

I personally use Beyond Compare (I found it was more accurate compared to the other programs I tested).

There is also Calibre's built-in compare: "File" -> "Compare to another book".

Quote:

Originally Posted by chaot

Probably you mean key parts of the code!?

The code (HTML tags) or the text (words/sentences). Both of these might get broken if you made a mistake when typing your Regex!

You might have made a typo and accidentally change:

Code:

<p>This is a sample sentence. 192</p>

<p>This is a sample sentence too.</p>

into:

Code:

<p>This is a sample sentence.

This is a sample sentence too.</p>

or:

Code:

<p>This is a sampleThis is a sample sentence too.</p>

Sometimes it is very hard to spot the error, and you don't see it until hours later when it is too late (you already made hundreds of other changes and corrections).

I just did this a few days ago... I accidentally typed an extra period in my Regex, and the second character of words were deleted ("Then" -> "Ten", "Suing" -> "Sing"). I didn't notice until later in the day that I made the mistake, and I had to manually correct many of the words.

Quote:

Originally Posted by chaot

Note: Adding an Example 3a in Post #1 (same book as Example 3)

Nothing is special about Example 3a.

Search: [0-9]+\s+

All that was added was the Blue code.

Note: If it was up to me, I strip out all the crap/useless code FIRST... then I could treat Example 3a just like Example 3.

Quote:

Originally Posted by chaot

The regex here, however, should eliminate with [306] also a blank space.
Don't be angry, I'm relatively sure the solution (for the elimination of a space) is
to find anywhere - only I would like a little sense of achievement quick and now.

I personally just run "Prettify Code" and that fixes the multi-space issues.

You could also just add spaces in the Regex to match your specific book.

Like Example #5 can turn into:

Search: *SPACE*\[[0-9]+\]*SPACE*

Also, you can just do a normal Search/Replace after everything to manually fix the "lots of spaces in a row" problem:

Search: *SPACE**SPACE*
Replace: *SPACE*

Quote:

Originally Posted by chaot

What's the different in S&R between setting Regex and Regex-Function?

https://manual.calibre-ebook.com/function_mode.html

I never used it before... but Regex-Function seems to allow you to use Python code for more powerful Search/Replace.

Quote:

Originally Posted by chaot

Stupid question!? Are these regaxes also fit for calibre?

Yes, I believe Sigil/Calibre use the same Regex Engine. At least all of the Regexes I have tested all work between Sigil/Calibre.

Quote:

Originally Posted by chaot

Would may be worth to create out of all these examples there something like a (regax examples) library - you know, cataloged and without bla-bla.

I don't believe there is a collection like that. Once you really learn the basics (by reading regular-expressions.info), you could really come up with all the Regex by yourself. That is what I do for the most part, I just create them on-the-fly as I need them... because each book's code comes with its own problems.

As you can see, a book might have:

a bold page number
an italic page number
a page number on its own line
a page number in the middle of text.
a bold+italic page number
[###]
(###)
###
###
###
###
[...]

It would make no sense to create a giant list of Regex for each of those... because they all follow the same basic rules!

I would just visit regular-expressions.info and follow along with the tutorials. It has lots of examples to learn from!

theducks · 04-27-2016, 03:12 PM

Quote:

Originally Posted by Tex2002ans

I don't believe there is a collection like that. Once you really learn the basics (by reading regular-expressions.info), you could really come up with all the Regex by yourself. That is what I do for the most part, I just create them on-the-fly as I need them... because each book's code comes with its own problems.

As you can see, a book might have:

a bold page number
an italic page number
a page number on its own line
a page number in the middle of text.
a bold+italic page number
[###]
(###)
###
###
###
###
[...]

It would make no sense to create a giant list of Regex for each of those... because they all follow the same basic rules!

I would just visit regular-expressions.info and follow along with the tutorials. It has lots of examples to learn from!

And all of the above in the same book (OCR of scan)

INHO what is also important is the ORDER you fix them. If you don't get it right, the next fix (or join) will be more difficult

I remove all Page Header type (Section/Title or Author) With a page number first (this is more than 1 template as there are right - left side variations)
I believe the Text Paragraph the Includes the page# is near the last I fix
(

I just look and do the needed REGEX now

)

Learn basic REGEX,

Tex2002ans · 04-27-2016, 04:23 PM

Quote:

Originally Posted by theducks

INHO what is also important is the ORDER you fix them. If you don't get it right, the next fix (or join) will be more difficult

Yep! And this is why you should try to normalize/clean the code as much as possible FIRST.

For example, here is some hideous code right out of an InDesign EPUB:

Quote:

The point is, as we can readily see, the ability to foresee an event is not at all equivalent to agreeing to it. Yes, I can full well predict that if I move to the South Bronx, I’ll likely be victimized by street crime. But this is not at all the same thing as acquiescing in such nefarious activities. Yet, according to the “libertarian” argument we are considering, the two are indistinguishable.

First thing I do is go through the code and strip it down to this:

Quote:

The point is, as we can readily see, the ability to foresee an event is not at all equivalent to agreeing to it. Yes, I can full well predict that if I move to the South Bronx, I’ll likely be victimized by street crime. But this is not at all the same thing as acquiescing in such nefarious activities. Yet, according to the “libertarian” argument we are considering, the two are indistinguishable.

and then it makes it much easier to do later fixes.

Diap's Editing Toolbag is great for cleaning up code:

https://www.mobileread.com/forums/sho....php?p=2980740

It is also great for helping get rid of a ton of the useless classes (), or changing certain tags into other tags ( -> ).

Each book is different, so you can't just have a big list of "Regexes to clean page numbers" that you can run on Book A + Book B + [...] + Book Z.

And with Calibre conversion code on top of this... the calibre# classes are completely different in each EPUB:

calibre2 in Book A might be the page numbers
calibre2 in Book B might be italics
[...]
calibre2 in Book Z might be headings

Quote:

Originally Posted by theducks

And all of the above in the same book (OCR of scan)

[...]

I remove all Page Header type (Section/Title or Author) With a page number first (this is more than 1 template as there are right - left side variations)

Headers/Footers in the actual text? Ouch. I haven't run across that one in quite a few years. What tools are being used to create that? I know Finereader does a pretty great job at ignoring Headers/Footers, and never exporting them in the first place.

theducks · 04-27-2016, 05:24 PM

Quote:

Originally Posted by Tex2002ans

Yep! And this is why you should try to normalize/clean the code as much as possible FIRST.

For example, here is some hideous code right out of an InDesign EPUB:

First thing I do is go through the code and strip it down to this:

and then it makes it much easier to do later fixes.

Diap's Editing Toolbag is great for cleaning up code:

https://www.mobileread.com/forums/sho....php?p=2980740

It is also great for helping get rid of a ton of the useless classes (), or changing certain tags into other tags ( -> ).

Each book is different, so you can't just have a big list of "Regexes to clean page numbers" that you can run on Book A + Book B + [...] + Book Z.

And with Calibre conversion code on top of this... the calibre# classes are completely different in each EPUB:

calibre2 in Book A might be the page numbers
calibre2 in Book B might be italics
[...]
calibre2 in Book Z might be headings

Headers/Footers in the actual text? Ouch. I haven't run across that one in quite a few years. What tools are being used to create that? I know Finereader does a pretty great job at ignoring Headers/Footers, and never exporting them in the first place.

Paperport, the FREE OCR that came with my scanner. What you scan is what they try and OCR . 2 Col source is a pain. Lucky me, I rarely see it.
Personal use, so I am not dropping big $ on a better OCR that get small time usage

Tex2002ans · 04-27-2016, 06:05 PM

Quote:

Originally Posted by theducks

Paperport, the FREE OCR that came with my scanner. What you scan is what they try and OCR . 2 Col source is a pain. Lucky me, I rarely see it.

Side Note: Hmmmmm... I have been writing a Scan Tailor tutorial. Maybe I could toss in some semi-related extra pre/postprocessing in the tutorial.

Depending on how much time you waste on having to clean up the headers/footers in the OCR, perhaps it might be best to preprocess those images (with Scan Tailor), and then crop the headers/footers right out, so that the OCR program can just focus on the body text:

Original Scan:

Click image for larger version

Name: OriginalScan.png
Views: 380
Size: 65.6 KB
ID: 148279

Scan Tailor:

Cropping:

Click image for larger version

Name: Stripped.png
Views: 564
Size: 162.8 KB
ID: 148281

2 column source... I luckily rarely come across that either. Although I would probably do something similar (come up with Imagemagick way to split the pages in half). I may be contacting you via PM for some examples soon (or you could always contact me).

theducks · 04-27-2016, 06:10 PM

Quote:

Originally Posted by Tex2002ans

Side Note: Hmmmmm... I have been writing a Scan Tailor tutorial. Maybe I could toss in some semi-related extra pre/postprocessing in the tutorial.

Depending on how much time you waste on having to clean up the headers/footers in the OCR, perhaps it might be best to preprocess those images (with Scan Tailor), and then crop the headers/footers right out, so that the OCR program can just focus on the body text:

Original Scan: Attachment 148279
Scan Tailor: Attachment 148280
Cropping: Attachment 148281

2 column source... I luckily rarely come across that either. Although I would probably do something similar (come up with Imagemagick way to split the pages in half). I may be contacting you via PM for some examples soon (or you could always contact me).

Old Analog Magazines are fun .

It is almost always Magazines with the 2col prob

eschwartz · 04-27-2016, 06:10 PM

Tex2002ans,

Sigil uses the PCRE library, whereas calibre uses Matthew Barnett's enhanced python regex module.

The difference is that PCRE supports a couple extensions the python module doesn't... but for the most part they provide the same features.

(You cannot capitalize captured text in calibre regex, but you can use a function replace instead. There's always multiple ways to fix the same problem.

)

chaot · 04-28-2016, 01:29 PM

@Tex2002ans: Example # 5 works fine with Search: *SPACE*\[[0-9]+\]
It is strange, I had known, only it had not occurred to me yesterday.
Are these the unmistakable signs!?

Some of you know: my access to the Internet is very limited, days or weeks nonexistent. Then I read namely the books, which I optimize with your help.

Now I take time out again!

Please, don't get off the track too much. I have to read all that stuff and then to understand, you know!? And please, not so many foreign words, technical terms etc., and don't forget the samples, photos ... well, that's an old story. Some of you are already do very well. Names are not mentioned.

04-26-2016, 01:19 PM	#3
chaot Head of lunatic asylum Posts: 349 Karma: 77620 Join Date: Jun 2012 Location: UTC +1 Device: Tolino Vision 3HD	Sorry, this has nothing to do with PDF; these are all excerpts from (now) EPUBs, were before real books scanned. Last edited by chaot; 04-26-2016 at 02:09 PM. Reason: add: from (now) EPUBs, were before real books scanned.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
What is the best way to convert scanned books?	Wolfrott	Conversion	9	02-14-2016 05:05 AM
Can't delete blank lines between paragraphs in mobi book	Waylander	Conversion	1	11-07-2015 06:03 AM
Story HD and Google Books scanned free books	wilsonch	iRiver Story	8	12-14-2011 10:23 PM
Regexes to improve pdf to epub conversion	ldolse	Calibre	23	04-22-2009 04:00 AM
Small scanned books	Paul Moews	iRex	22	02-05-2009 05:58 PM

04-26-2016, 12:53 PM	#2
Divingduck Wizard Posts: 1,161 Karma: 1404241 Join Date: Nov 2010 Location: Germany Device: Sony PRS-650	Guess, you don't touch conversion parameters before you start your conversion work and everything is on standard setup. This is for PDF conversion not the best choice. You need to set a better line unwrapping factor for PDF input files. Standard is .45, check out something between .15 and .25

04-26-2016, 05:44 PM	#4
Tex2002ans Wizard Posts: 2,297 Karma: 12126329 Join Date: Jul 2012 Device: Kobo Forma, Nook	Question: Is there an actual space before the final closing </p>? And can it actually be relied upon? In my experience, I wouldn't trust this with a ten foot pole, and would have to check each one on a case-by-case basis. I definitely wouldn't completely rely on a Replace All. Regex Solutions I would handle this specific cleanup in a few passes. First, make sure that you SAVE A COPY before you do anything. Then make sure you don't press Replace All unless you know exactly what you are doing (and have tested a few to make sure the Regex is working properly). Even then, make sure you do a code comparison of the Before/After to make sure you didn't delete key parts of the text. Before Examples I would just do a simple Search and Replace to strip out all: <p class="calibre2"></p> and <p class="calibre2"/> Example #1-3 If you run the above Search/Replaces, then example #1-3 can be condensed into this: Search: [0-9]+</p>\s+<p class="calibre2"> Replace: BLANK OR A SPACE Note: In these examples, Red denotes the Regex that matches the page numbers. Note: In English, the Red portion says "look for 1 or more numbers in a row". The Blue portion says "look for 1 or more whitespace characters". Note: There can be legitimate usages of numbers (for example, years/dates/ages). Be careful. Example #4 Search: [IXVL]+</p>\s+<p class="calibre2"> Replace: BLANK OR A SPACE Note: In English, Red says "look for the 1 or more 'I' + 'X' + 'L' + 'V' in a row". This should match roman numerals like "IX", "XIII", "XXIV". Note: "I" is used very often in English, so be careful. Note: Make sure you have the "Case-sensitive" button turned on. Example #5 Search: \[[0-9]+\] Replace: BLANK OR A SPACE Note: In English, Red says "look for a left bracket" + "look for 1 or more numbers in a row" + "look for a right bracket". After Examples Beyond that point, you stated that hyphens should be removed... I would strongly recommend against this. Each one of these has to be checked on a case-by-case basis. The hyphen may actually be a hard hyphen (for example, in the word "all-purpose" might have been broken across pages). For checking hyphens at the end of paragraphs, I personally run this regex: Search: -</p>\s+<p> Replace: BLANK It shouldn't be too bad manually correcting these. In reality, you only have to check a handful of hyphens that were at the end of pages. I would highly recommend learning at least the basics of Regex: http://www.regular-expressions.info/quickstart.html There is also a huge "Regex examples" thread in the Sigil section of the forums: https://www.mobileread.com/forums/sho...d.php?t=167971 These examples you posted are relatively easy. Side Note: Thanks for saving your example images as PNG. Vastly superior compared to people who post screenshots as JPG. Last edited by Tex2002ans; 04-26-2016 at 06:12 PM.

04-27-2016, 06:10 PM	#12
eschwartz Ex-Helpdesk Junkie Posts: 19,422 Karma: 85397180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	Tex2002ans, Sigil uses the PCRE library, whereas calibre uses Matthew Barnett's enhanced python regex module. The difference is that PCRE supports a couple extensions the python module doesn't... but for the most part they provide the same features. (You cannot capitalize captured text in calibre regex, but you can use a function replace instead. There's always multiple ways to fix the same problem. )

04-28-2016, 01:29 PM	#13
chaot Head of lunatic asylum Posts: 349 Karma: 77620 Join Date: Jun 2012 Location: UTC +1 Device: Tolino Vision 3HD	@Tex2002ans: Example # 5 works fine with Search: SPACE\[[0-9]+\] It is strange, I had known, only it had not occurred to me yesterday. Are these the unmistakable signs!? Some of you know: my access to the Internet is very limited, days or weeks nonexistent. Then I read namely the books, which I optimize with your help. Now I take time out again! Please, don't get off the track too much. I have to read all that stuff and then to understand, you know!? And please, not so many foreign words, technical terms etc., and don't forget the samples, photos ... well, that's an old story. Some of you are already do very well. Names are not mentioned.