View Single Post
Old 04-16-2020, 09:31 PM   #19
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Hitch View Post
Are you BORED, snookums?
Just seeing if the theory plays out in other materials.

Could take a teensy weensy break from the madness to write up another beast.

You know I'm stuck over here in my little Economics/History/Non-Fiction bubble, and I love my footnotes! Once I see that word, I begin foaming at the mouth!

Quote:
Originally Posted by droopy View Post
Hi Tex,
PM sent.
Thanks. I quickly scanned through droopy's 3 PDFs.

The PDFs don't actually have superscript footnotes.

The actual text uses the form:

Code:
Example sentence.<sup>1</sup>
but the footnotes at the bottom then use:

Code:
1. Example footnote.
separated by a blank gap between body-text/footnotes.

And like I said earlier, Finereader does an okay job at detecting differences between body-text/footnotes. In this specific case, it detected most footnotes okay (definitely looks better than Word's PDF Import in that regard).

* * *

And here is ~ the rest of the PM I sent droopy:

I generated 3 types of files:

1. [Finereader] - This is a DOCX generated straight from Finereader.

2. [Toxaris] - This is the [Finereader] DOCX, which I ran through Toxaris's fantastic "EPUB Tools".

Note: It tries its best to clean up a bunch of Finereader's hidden junk, and do some basic cleanup like combine broken paragraphs together, etc.

The text with red highlights is paragraphs that could be broken/merged incorrectly, so you can more closely look at them and fix manually if needed.

3. EPUB - This was generated straight from EPUB Tools using the [Toxaris] DOCX.

Because this was all OCRed (and PDF sucks + the source files weren't the greatest), there ARE going to be the usual OCR issues creeping in there:
  • Text may be wrong (OCR is "99.9% accurate")
    • Some of these scans weren't the greatest either (crooked, still see page edges, etc.), so this introduces more error.
  • Formatting may be wrong
    • Italics missing, headings may not be headings, etc.
  • While many footnotes were detected properly, many weren't.
    • On top of that, the problem with PDF->DOCX "automated footnotes" is... the numbers may now be thrown way off. If 1-4 + 6-10 were detected fine... Word will only think there are "9 actual footnotes". 5 will be floating in the text, and 6-10 will now be off by 1.

So it's up to you... you could:
  • Do your cleanup in the EPUB
  • or work through that [Toxaris] DOCX and try to do your cleanup directly in Word.

But as has been discussed on MobileRead many, many times... PDFs are awful as input formats.

If you want perfectly clean ebooks, you would have to get in there and do all the manual corrections, there just ain't no way around it.

Last edited by Tex2002ans; 04-16-2020 at 09:54 PM.
Tex2002ans is offline   Reply With Quote