Quote:
Originally Posted by Tex2002ans
Yep! And this is why you should try to normalize/clean the code as much as possible FIRST.
For example, here is some hideous code right out of an InDesign EPUB:
First thing I do is go through the code and strip it down to this:
and then it makes it much easier to do later fixes.
Diap's Editing Toolbag is great for cleaning up code:
https://www.mobileread.com/forums/sho....php?p=2980740
It is also great for helping get rid of a ton of the useless classes (<span class="no-style-override-5">), or changing certain tags into other tags (<span class="no-style-override-4"> -> <i>).
Each book is different, so you can't just have a big list of "Regexes to clean page numbers" that you can run on Book A + Book B + [...] + Book Z.
And with Calibre conversion code on top of this... the calibre# classes are completely different in each EPUB:
- calibre2 in Book A might be the page numbers
- calibre2 in Book B might be italics
- [...]
- calibre2 in Book Z might be headings
Headers/Footers in the actual text? Ouch. I haven't run across that one in quite a few years. What tools are being used to create that? I know Finereader does a pretty great job at ignoring Headers/Footers, and never exporting them in the first place.
|
Paperport, the FREE OCR that came with my scanner. What you scan is what they
try and OCR . 2 Col source is a pain. Lucky me, I rarely see it.
Personal use, so I am not dropping big $ on a better OCR that get small time usage