Quote:
Originally Posted by jackie_w
I'm not sure I agree with these generalisations. Admittedly it is a long time (10+ years) since I made a concerted effort to "convert" my own (relatively few) PDFs to good HTML but I was finally able to do it without losing italics/bold, headings and scenebreaks. Maybe I was just lucky but the text layer in my PDFs was excellent, i.e. no indication the the text was the result of auto-OCR.
|
If you're lucky enough to be working with fresh, digital documents... then they were perhaps spit directly out of Word/InDesign with the proper checkboxes pushed!
Nowadays, there's a bigger push for:
which are a HUGE step in the right direction.
(This attaches important information—Heading/Paragraph + Bold/Italic + Headers/Footers/PageNumbers—into the PDF too, so tools like Text-to-Speech can step through the document and navigate correctly.)
Theoretically, I suspect this sort of PDF->EPUB conversion easier... but you'd still probably be better off going through a known toolchain, instead of trying to unravel who-knows-what-unique-garbage-is-buried-in-that-PDF.
- - -
But again, PDF is a final OUTPUT format... it's an absolutely trash INPUT format—so should only be used as a
very last resort.
And, as always, if possible, it's
always best to go back to the original source document (DOCX/ODT, RTF, TXT, ...)
and convert from there.
- - -
Quote:
Originally Posted by jackie_w
Most PDFs I converted introduced some kind of new challenge that I hadn't seen previously.
|
Yep, each PDF would be like a completely new black box. :P
Similar happens when people try to do EPUB->EPUB, unraveling all the spaghetti of HTML+CSS someone created.
Most of the time, it's faster/easier to just go back to the drawing board and restart your conversion from scratch.
You could see some of that described in this post, where I explained to RbnJrg how I'd handle "surgically correcting" 20 ebooks in the same series:
- - -
Side Note: Since 2021, KevinH has since implemented many of those theoretical features into Sigil!
Advanced cleanup tools, but EXTREMELY powerful ways to mass fix HTML+CSS much more quickly.
- - -
Quote:
Originally Posted by jackie_w
I didn't waste my time even trying to convert multi-column or non-fiction documents with many tables or footnotes. Too hard! Fiction novels only. 
|
lol. And that's almost always where I dabble—Non-Fiction + all the hard stuff.
Right now, I'm working on an ebook with 1400 Endnotes!*
And the 2nd ebook has ~190 Figures! (Ugh, that amount of
alt text generation... kill me now... lol.)
- - -
* Side Note: Speaking of Endnotes...
Does anybody here know how to wrestle Microsoft Word into outputting:
- Endnotes per chapter.
- An Endnotes section being placed NOT at the very very end of the document, but near-the-end.
- So imagine it'll be at page 290/300, with 10 pages of author/publisher backmatter afterwards
- - -
Quote:
Originally Posted by jackie_w
I won't bore everyone with great detail but I found that the best method for Step 1 of the whole process was to find a utility which would extract the PDF text as an XML tree. This had the benefit of retaining a lot more info about each text snippet extracted, [...] identify correct reading order, paragraph starts, scenebreaks and those unwanted PDF headers/footers.
|
Hmmm... isn't that already what Calibre is doing when it's using its heuristics?
Then, you just have a spaghetti nest of Calibre-converted classes to mess with, but that would be
infinitely easier than this custom PDF-exporting+parsing+converting stuff. (See the "surgical" thread/methods above.)
Quote:
Originally Posted by jackie_w
P.S. There's no way I would offer this as a service-for-hire unless I could pick and choose the PDFs I was prepared to tackle. It was an interesting (and bloody-minded) personal project but not something I'd want to do on a regular basis.
|
lol. Of course. And then once you tell them how much $$$ your crazy method would take...
Instead, you can just:
- Toss PDF right into Finereader.
- It takes care of all that Formatting + Font Size + guessing for you.
- It spits out relatively damn good DOCX/EPUB with consistent output.
And boom, with so much less work, now, you can layer the book's unique quirks on top of THAT BASE document.
Easier to take that clean-but-not-quite-correct text and:
- Manually correct the OCR errors.
- + Layer your Headings / Chapter Breaks / Tables on top of that.
than to:
- Create some custom PDF-conversion thing for every book.
- Spend hours manually mapping all the fonts/bold/italics/whatever.
- Disentangling every "treasure" of wrenches in that PDF that weren't found anywhere else...
- [...]
Learn from me + Hitch—the pros—lol.
Trying to convert from PDF like that is a dark, dark, path!