MobileRead Forums - View Single Post - Is there a pdf->epub option that keeps the design better

Tex2002ans · 11-11-2023, 04:26 PM

Quote:

Originally Posted by jackie_w

I'm not sure I agree with these generalisations. Admittedly it is a long time (10+ years) since I made a concerted effort to "convert" my own (relatively few) PDFs to good HTML but I was finally able to do it without losing italics/bold, headings and scenebreaks. Maybe I was just lucky but the text layer in my PDFs was excellent, i.e. no indication the the text was the result of auto-OCR.

If you're lucky enough to be working with fresh, digital documents... then they were perhaps spit directly out of Word/InDesign with the proper checkboxes pushed!

Nowadays, there's a bigger push for:

Tagged PDFs

which are a HUGE step in the right direction.

(This attaches important information—Heading/Paragraph + Bold/Italic + Headers/Footers/PageNumbers—into the PDF too, so tools like Text-to-Speech can step through the document and navigate correctly.)

Theoretically, I suspect this sort of PDF->EPUB conversion easier... but you'd still probably be better off going through a known toolchain, instead of trying to unravel who-knows-what-unique-garbage-is-buried-in-that-PDF.

- - -

But again, PDF is a final OUTPUT format... it's an absolutely trash INPUT format—so should only be used as a very last resort.

And, as always, if possible, it's always best to go back to the original source document (DOCX/ODT, RTF, TXT, ...) and convert from there.

- - -

Quote:

Originally Posted by jackie_w

Most PDFs I converted introduced some kind of new challenge that I hadn't seen previously.

Yep, each PDF would be like a completely new black box. :P

Similar happens when people try to do EPUB->EPUB, unraveling all the spaghetti of HTML+CSS someone created.

Most of the time, it's faster/easier to just go back to the drawing board and restart your conversion from scratch.

You could see some of that described in this post, where I explained to RbnJrg how I'd handle "surgically correcting" 20 ebooks in the same series:

2021: "Adding a limited Automate Feature To Sigil"

- - -

Side Note: Since 2021, KevinH has since implemented many of those theoretical features into Sigil!

Advanced cleanup tools, but EXTREMELY powerful ways to mass fix HTML+CSS much more quickly.

- - -

Quote:

Originally Posted by jackie_w

I didn't waste my time even trying to convert multi-column or non-fiction documents with many tables or footnotes. Too hard! Fiction novels only.

lol. And that's almost always where I dabble—Non-Fiction + all the hard stuff.

Right now, I'm working on an ebook with 1400 Endnotes!*

And the 2nd ebook has ~190 Figures! (Ugh, that amount of alt text generation... kill me now... lol.)

- - -

* Side Note: Speaking of Endnotes...

Does anybody here know how to wrestle Microsoft Word into outputting:

Endnotes per chapter.
An Endnotes section being placed NOT at the very very end of the document, but near-the-end.
- So imagine it'll be at page 290/300, with 10 pages of author/publisher backmatter afterwards

- - -

Quote:

Originally Posted by jackie_w

I won't bore everyone with great detail but I found that the best method for Step 1 of the whole process was to find a utility which would extract the PDF text as an XML tree. This had the benefit of retaining a lot more info about each text snippet extracted, [...] identify correct reading order, paragraph starts, scenebreaks and those unwanted PDF headers/footers.

Hmmm... isn't that already what Calibre is doing when it's using its heuristics?

Then, you just have a spaghetti nest of Calibre-converted classes to mess with, but that would be infinitely easier than this custom PDF-exporting+parsing+converting stuff. (See the "surgical" thread/methods above.)

Quote:

Originally Posted by jackie_w

P.S. There's no way I would offer this as a service-for-hire unless I could pick and choose the PDFs I was prepared to tackle. It was an interesting (and bloody-minded) personal project but not something I'd want to do on a regular basis.

lol. Of course. And then once you tell them how much $$$ your crazy method would take...

Instead, you can just:

Toss PDF right into Finereader.
- It takes care of all that Formatting + Font Size + guessing for you.
It spits out relatively damn good DOCX/EPUB with consistent output.
- So you know its quirks + can automate a lot of the fixes.

And boom, with so much less work, now, you can layer the book's unique quirks on top of THAT BASE document.

Easier to take that clean-but-not-quite-correct text and:

Manually correct the OCR errors.
+ Layer your Headings / Chapter Breaks / Tables on top of that.

than to:

Create some custom PDF-conversion thing for every book.
Spend hours manually mapping all the fonts/bold/italics/whatever.
Disentangling every "treasure" of wrenches in that PDF that weren't found anywhere else...
- Like every word being wrapped in its own <span>s + miles and miles of near-duplicate-but-slightly-different CSS.
[...]

Learn from me + Hitch—the pros—lol.

Trying to convert from PDF like that is a dark, dark, path!