View Single Post
Old 11-11-2023, 04:26 PM   #12
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by jackie_w View Post
I'm not sure I agree with these generalisations. Admittedly it is a long time (10+ years) since I made a concerted effort to "convert" my own (relatively few) PDFs to good HTML but I was finally able to do it without losing italics/bold, headings and scenebreaks. Maybe I was just lucky but the text layer in my PDFs was excellent, i.e. no indication the the text was the result of auto-OCR.
If you're lucky enough to be working with fresh, digital documents... then they were perhaps spit directly out of Word/InDesign with the proper checkboxes pushed!

Nowadays, there's a bigger push for:
  • Tagged PDFs

which are a HUGE step in the right direction.

(This attaches important information—Heading/Paragraph + Bold/Italic + Headers/Footers/PageNumbers—into the PDF too, so tools like Text-to-Speech can step through the document and navigate correctly.)

Theoretically, I suspect this sort of PDF->EPUB conversion easier... but you'd still probably be better off going through a known toolchain, instead of trying to unravel who-knows-what-unique-garbage-is-buried-in-that-PDF.

- - -

But again, PDF is a final OUTPUT format... it's an absolutely trash INPUT format—so should only be used as a very last resort.

And, as always, if possible, it's always best to go back to the original source document (DOCX/ODT, RTF, TXT, ...) and convert from there.

- - -

Quote:
Originally Posted by jackie_w View Post
Most PDFs I converted introduced some kind of new challenge that I hadn't seen previously.
Yep, each PDF would be like a completely new black box. :P

Similar happens when people try to do EPUB->EPUB, unraveling all the spaghetti of HTML+CSS someone created.

Most of the time, it's faster/easier to just go back to the drawing board and restart your conversion from scratch.

You could see some of that described in this post, where I explained to RbnJrg how I'd handle "surgically correcting" 20 ebooks in the same series:

- - -

Side Note: Since 2021, KevinH has since implemented many of those theoretical features into Sigil!

Advanced cleanup tools, but EXTREMELY powerful ways to mass fix HTML+CSS much more quickly.

- - -

Quote:
Originally Posted by jackie_w View Post
I didn't waste my time even trying to convert multi-column or non-fiction documents with many tables or footnotes. Too hard! Fiction novels only.
lol. And that's almost always where I dabble—Non-Fiction + all the hard stuff.

Right now, I'm working on an ebook with 1400 Endnotes!*

And the 2nd ebook has ~190 Figures! (Ugh, that amount of alt text generation... kill me now... lol.)

- - -

* Side Note: Speaking of Endnotes...

Does anybody here know how to wrestle Microsoft Word into outputting:
  • Endnotes per chapter.
  • An Endnotes section being placed NOT at the very very end of the document, but near-the-end.
    • So imagine it'll be at page 290/300, with 10 pages of author/publisher backmatter afterwards

- - -

Quote:
Originally Posted by jackie_w View Post
I won't bore everyone with great detail but I found that the best method for Step 1 of the whole process was to find a utility which would extract the PDF text as an XML tree. This had the benefit of retaining a lot more info about each text snippet extracted, [...] identify correct reading order, paragraph starts, scenebreaks and those unwanted PDF headers/footers.
Hmmm... isn't that already what Calibre is doing when it's using its heuristics?

Then, you just have a spaghetti nest of Calibre-converted classes to mess with, but that would be infinitely easier than this custom PDF-exporting+parsing+converting stuff. (See the "surgical" thread/methods above.)

Quote:
Originally Posted by jackie_w View Post
P.S. There's no way I would offer this as a service-for-hire unless I could pick and choose the PDFs I was prepared to tackle. It was an interesting (and bloody-minded) personal project but not something I'd want to do on a regular basis.
lol. Of course. And then once you tell them how much $$$ your crazy method would take...

Instead, you can just:
  • Toss PDF right into Finereader.
    • It takes care of all that Formatting + Font Size + guessing for you.
  • It spits out relatively damn good DOCX/EPUB with consistent output.

And boom, with so much less work, now, you can layer the book's unique quirks on top of THAT BASE document.

Easier to take that clean-but-not-quite-correct text and:
  • Manually correct the OCR errors.
  • + Layer your Headings / Chapter Breaks / Tables on top of that.

than to:

Learn from me + Hitch—the pros—lol.

Trying to convert from PDF like that is a dark, dark, path!
Tex2002ans is offline   Reply With Quote