MobileRead Forums - View Single Post - Best way to copy text from a PDF or MOBI?

Tex2002ans · 10-02-2013, 05:50 PM

The OCR step is part of the reason why there are so many errors in ebook versions. Usually the publisher decides to go the cheapest route, and pay a (crappy) conversion company to do the PDF -> text conversion.

The quality of these will be slightly better than Archive.org (pure OCR with no intervention), but usually the texts are still riddled with more typos than a reader would want.

Common mistakes:

0 -> O
m -> rn
Hyphenation problems
Missing punctuation
Missing quotation marks
Missing accents: à, ö, ê, ǒ, Å
Missing symbols: ¢, £
Missing ligatures: Æ, œ
Wrong foreign characters: α, ß, ε
Wrongfully combined/uncombined paragraphs
Wrongful bold/italics

Part of the reason I got into this was the horrors I was running into when reading EPUBs. So I decided to take a stab at it. In the past year I have converted over 160 books from PDF -> EPUB.

Started in ~December 2011 by taking apart EPUBs and fixing typos as I read. Ramped up EPUB production in October 2012, and officially hired since April 2013... so now I just sit around all day doing PROPER conversions.

Quote:

Originally Posted by mb2u

Tex2002ans, that was one marvelous and informative post! It will take me a while to go through it but much of what you said rings true already.

Thank you for the compliments.

But as I said, it is quite time intensive. When I first got started it took about one-two weeks for me to get through one PDF -> completed EPUB.

Nowadays I have it wittled down to ~8-15 hours of work for your average book, some more, some less. (I convert non-fiction economics books for the most part).

Also, if you want to convert fiction, just keep in mind that you may potentially spoil the book for yourself while OCRing! After working on these for so long you learn how to "not read" while fixing, but you still risk potentially spoiling the story!

Luckily with non-fiction, if I "accidentally read" I actually learn stuff!