View Single Post
Old 04-27-2016, 04:23 PM   #8
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by theducks View Post
INHO what is also important is the ORDER you fix them. If you don't get it right, the next fix (or join) will be more difficult
Yep! And this is why you should try to normalize/clean the code as much as possible FIRST.

For example, here is some hideous code right out of an InDesign EPUB:

Quote:
<p class="body-text" xml:lang="en-us"><span class="no-style-override-5">The point is, as we can readily see, the ability to</span> <span class="no-style-override-4">foresee</span> <span class="no-style-override-5">an event is not at all equivalent to</span> <span class="no-style-override-4">agreeing</span> <span class="no-style-override-5">to it. Yes, I can full well</span> <span class="no-style-override-4">predict</span> <span class="no-style-override-5">that if I move to the South Bronx, I’ll likely be victimized by street crime. But this is not at</span> <span class="no-style-override-4">all</span> <span class="no-style-override-5">the same thing as</span> <span class="no-style-override-4">acquiescing</span> <span class="no-style-override-5">in such nefarious activities. Yet, according to the “libertarian” argument we are considering, the two are indistinguishable.</span></p>
First thing I do is go through the code and strip it down to this:

Quote:
<p>The point is, as we can readily see, the ability to <i>foresee</i> an event is not at all equivalent to <i>agreeing</i> to it. Yes, I can full well <i>predict</i> that if I move to the South Bronx, I’ll likely be victimized by street crime. But this is not at <i>all</i> the same thing as <i>acquiescing</i> in such nefarious activities. Yet, according to the “libertarian” argument we are considering, the two are indistinguishable.</p>
and then it makes it much easier to do later fixes.

Diap's Editing Toolbag is great for cleaning up code:

https://www.mobileread.com/forums/sho....php?p=2980740

It is also great for helping get rid of a ton of the useless classes (<span class="no-style-override-5">), or changing certain tags into other tags (<span class="no-style-override-4"> -> <i>).

Each book is different, so you can't just have a big list of "Regexes to clean page numbers" that you can run on Book A + Book B + [...] + Book Z.

And with Calibre conversion code on top of this... the calibre# classes are completely different in each EPUB:
  • calibre2 in Book A might be the page numbers
  • calibre2 in Book B might be italics
  • [...]
  • calibre2 in Book Z might be headings

Quote:
Originally Posted by theducks View Post
And all of the above in the same book (OCR of scan)

[...]

I remove all Page Header type (Section/Title or Author) With a page number first (this is more than 1 template as there are right - left side variations)
Headers/Footers in the actual text? Ouch. I haven't run across that one in quite a few years. What tools are being used to create that? I know Finereader does a pretty great job at ignoring Headers/Footers, and never exporting them in the first place.
Tex2002ans is offline   Reply With Quote