MobileRead Forums - View Single Post - Best way to copy text from a PDF or MOBI?

Tex2002ans · 10-02-2013, 02:53 AM

PDF is just about the worst format to convert FROM. PDF was built as a final output print format.

See this in Calibre's help files: http://manual.calibre-ebook.com/conv...#pdfconversion

In all cases, it is best to go back to the source document and work from there.

The best of the worst case scenario would be having a PDF created directly from the source (InDesign, Quark, etc.). You can tell when zooming in on the PDF, the text/graphs stay extremely crisp.

Click image for larger version

Name: page5.png
Views: 2647
Size: 107.0 KB
ID: 112573

Click image for larger version

Name: page5zoom.png
Views: 3035
Size: 43.4 KB
ID: 112574

These might be able to have text extracted from them ok (although still a lot of errors can/will be introduced). I believe Calibre uses xpdf in the backend to handle pulling text out of PDFs:

http://www.foolabs.com/xpdf/download.html

Someone on the forums probably has a lot more experience with this type. I never work from this type (we usually have the source files for these).

Quote:

Originally Posted by mb2u

How about whole lines missing? Pretty hard to live with!

Sounds to me like you have a scanned book.

This is the worst case scenario. The text backend in the PDF most likely was just fed through Tesseract, Finereader, the scanner's built-in OCR, etc... and spit out with no human intervention. This is the case, for example, on the conversions to different formats on archive.org. There will be a ton of errors.

Your best bet would be to start from scratch, using the latest version of the OCR programs (later versions most likely have more accurate OCR).

Here is a whole list of different OCR programs: https://en.wikipedia.org/wiki/Compar...ition_software

If you want higher quality output, you would also have to painstakingly go through and manually fix errors that you find. It is very laborious work.

I personally use ABBYY Finereader (this is a paid program, quite expensive, but well worth it if you do a lot of conversions): http://finereader.abbyy.com/

It is very accurate with a whole host of texts/languages, and allows you to easily side-by-side compare image to OCRed text (highlights unsure characters in light blue).

Here is a book in Finereader that I am currently working on converting:

Click image for larger version

Name: FinereaderSidebySide.png
Views: 1089
Size: 176.3 KB
ID: 112575

Left = Original Document
Right = OCRed Text
Bottom = Magnified area in the original document

Even after export, you must still spend a lot of time fixing the output (combining paragraphs, removing accidental hyphens, adding formatting, splitting chapters, etc. etc.).

Overall, PDF -> anything is horrible.

Quote:

Originally Posted by willus

This is somewhat dependent on the PDF file itself. Can you post any examples of the PDF files and the errors you got when extracting the text with Calibre?

Indeed. Post samples.

If the book is in the public domain, try to get it from Project Gutenberg, where it goes through multiple human revisions.

Or the MR ebook sections:

Kindle: https://www.mobileread.com/forums/forumdisplay.php?f=128
EPUB: https://www.mobileread.com/forums/forumdisplay.php?f=130