MobileRead Forums - View Single Post - Delete paragraphs in scanned books (S & R with regexes)

theducks · 04-27-2016, 05:24 PM

Quote:

Originally Posted by Tex2002ans

Yep! And this is why you should try to normalize/clean the code as much as possible FIRST.

For example, here is some hideous code right out of an InDesign EPUB:

First thing I do is go through the code and strip it down to this:

and then it makes it much easier to do later fixes.

Diap's Editing Toolbag is great for cleaning up code:

https://www.mobileread.com/forums/sho....php?p=2980740

It is also great for helping get rid of a ton of the useless classes (<span class="no-style-override-5">), or changing certain tags into other tags (<span class="no-style-override-4"> -> <i>).

Each book is different, so you can't just have a big list of "Regexes to clean page numbers" that you can run on Book A + Book B + [...] + Book Z.

And with Calibre conversion code on top of this... the calibre# classes are completely different in each EPUB:

calibre2 in Book A might be the page numbers
calibre2 in Book B might be italics
[...]
calibre2 in Book Z might be headings

Headers/Footers in the actual text? Ouch. I haven't run across that one in quite a few years. What tools are being used to create that? I know Finereader does a pretty great job at ignoring Headers/Footers, and never exporting them in the first place.

Paperport, the FREE OCR that came with my scanner. What you scan is what they try and OCR . 2 Col source is a pain. Lucky me, I rarely see it.
Personal use, so I am not dropping big $ on a better OCR that get small time usage