View Single Post
Old 04-27-2016, 05:24 PM   #9
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 31,123
Karma: 60406498
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by Tex2002ans View Post
Yep! And this is why you should try to normalize/clean the code as much as possible FIRST.

For example, here is some hideous code right out of an InDesign EPUB:



First thing I do is go through the code and strip it down to this:



and then it makes it much easier to do later fixes.

Diap's Editing Toolbag is great for cleaning up code:

https://www.mobileread.com/forums/sho....php?p=2980740

It is also great for helping get rid of a ton of the useless classes (<span class="no-style-override-5">), or changing certain tags into other tags (<span class="no-style-override-4"> -> <i>).

Each book is different, so you can't just have a big list of "Regexes to clean page numbers" that you can run on Book A + Book B + [...] + Book Z.

And with Calibre conversion code on top of this... the calibre# classes are completely different in each EPUB:
  • calibre2 in Book A might be the page numbers
  • calibre2 in Book B might be italics
  • [...]
  • calibre2 in Book Z might be headings



Headers/Footers in the actual text? Ouch. I haven't run across that one in quite a few years. What tools are being used to create that? I know Finereader does a pretty great job at ignoring Headers/Footers, and never exporting them in the first place.
Paperport, the FREE OCR that came with my scanner. What you scan is what they try and OCR . 2 Col source is a pain. Lucky me, I rarely see it.
Personal use, so I am not dropping big $ on a better OCR that get small time usage
theducks is offline   Reply With Quote