MobileRead Forums - View Single Post

crashnburn · 05-04-2013, 03:54 PM

Just to preserve this knowledge and information in contextually separate topic/ subject I am creating a fresh thread here - Please post replies/ ideas relating to conversion here -

Steps/ best practices for converting PDFs to ePUBs? Thoughts and Ideas

PS: I am posting additional details, thoughts, questions relating to my post and your replies/ suggestions/ knowledge. Please do check it my next reply and share additional thoughts.

Quote:

Originally Posted by crashnburn

Is there a thread/ location/ tutorial that outlines steps/ best practices for converting PDFs to ePUBs?

I have some PDFs that I am reading & highlighting in Good Reader I'd like to push to ePub so that I could do so in Marvin instead.

I am sure faterson has expertise on this. Wondering if there is a thread/ tutorial / steps that are a recommended read.

Quote:

Originally Posted by Faterson

PS: If, despite the above warning, you decide to go ahead, I recommend not to use Calibre for PDF conversion. I've had better results using the old Mobipocket Reader (killed by Amazon similarly to Stanza). Here is the install file for Windows. The result of the automatic, one-click conversion will still be very poor, but that's just the way it is.

For optimal quality of conversion from PDF to EPUB, you need to sacrifice all those extra hours of manual work, and use top-quality OCR software such as FineReader. I create a HTML file in FineReader, then fine-tune that HTML file by manual coding (using the EditPlus plain-text editor for Windows) until its code is approved by W3C's validator. No fluff must be left inside the file -- the CSS must be minimalistic. Finally, I convert the HTML file to EPUB in Calibre, and that's it.

Quote:

Originally Posted by Faterson

I indeed have expertise on that, and that expertise says: don't bother!

It's sad but true. My daily reading is split roughly evenly between Marvin and GoodReader, precisely because converting PDFs to EPUBs is often a hopeless undertaking. And, many books (especially old, scanned editions) are only available as PDF files -- or the EPUB versions of the same texts are of such ridiculously bad quality, they are unreadable. (Yes, I'm talking about you, archive.org.)

Only this weekend, I was converting a short novel (novelette), 25 thousand words, 45 pages of PDF source file, from PDF to EPUB, precisely so that I could enjoy reading it in Marvin, rather than GoodReader.

I used the best available OCR software for the conversion, which is FineReader.

Even so, it took me nearly 5 hours (!) to convert the PDF file so that I was satisfied with the EPUB result. It's just impractical. However, this was a novelette I deeply cared about, so I was willing to sacrifice the 5 hours of my time for the conversion. I would, of course, not be ready to do that on a regular basis, because my remuneration for the work was exactly 0 cents. The only reward I'll get will be the pleasure of reading that file in Marvin. Hell, that's enough for me (in this special case).

Quote:

Originally Posted by Jessica Lares

I will add to that too and also give the same opinion. PDFs are usually designed to be printed and are made in programs like InDesign, Quark, and Acrobat which pretty much work as WYSIWYG (what you see is what you get) editors.

Most of the text is done in individual boxes, one for the heading, one for each paragraph, column, etc. And they're layered, so you're just hoping that the writer did add them one after another, which is never the case. This becomes apparent when you're making selections and something else is being highlighted.

Stick any PDF document into Adobe's Acrobat editor, and you literally see how awful the setup is.

I would think OCR would work better with a flattened image, as long as it was 300dpi or more.