Best practice to OCR and convert PDF to text or html or epub

crankypants · 10-28-2015, 11:33 AM

I'd like to convert some PDF files to EPUB but the PDF files are just scanned images, some of the pages are not that great. Example: Book of the Farm.

What's your best set of steps using windows or Ubuntu software?

PandathePanda · 10-28-2015, 11:41 AM

I'm on Windows and I use onenote (part of MS Office). I've good to great results depending on how good the scan is.

crankypants · 10-28-2015, 01:21 PM

I've never used OneNote so I poked around a bit and I don't see a way to open or import a PDF file. How do you do it in OneNote?

PandathePanda · 10-28-2015, 01:31 PM

Quote:

Originally Posted by crankypants

I've never used OneNote so I poked around a bit and I don't see a way to open or import a PDF file. How do you do it in OneNote?

What I do is to open the pdf then right click and select copy picture, if it's available. Else printscreen. Then paste it into onenote. Then right click on the picture and select copy text from picture.

crankypants · 10-28-2015, 01:57 PM

Oh, that won't work for me. I have 500+ scanned pages to convert.

eschwartz · 10-28-2015, 02:28 PM

The real professionals swear by ABBYY Finereader. But that doesn't come cheap.

Of course, depending on how much use you get out of it, it might be worth the expenditure.

The best free alternative is the open-source Tesseract OCR engine, which can be used by various graphical frontends.
You could try k2pdfopt -- it is a PDF reflow tool that can also embed OCR data using Tesseract.

And it's a CLI tool, so easily scriptable.

Notjohn · 10-29-2015, 06:02 AM

I thought Finereader had a website option for single jobs, but apparently not.

The trial version will do 100 pages, but I think only 3 pages at a time.

I got mine on a bit of a sale. I think it was under $100, worth it just for the one book I scanned.

(Unfortunately, the book was about ski-bums, and Finereader interpreted the lower case m as rn, so I got a lot of ski-burns in the Word doc. I suppose that was font-specific.)

HarryT · 10-29-2015, 11:11 AM

Quote:

Originally Posted by Notjohn

(Unfortunately, the book was about ski-bums, and Finereader interpreted the lower case m as rn, so I got a lot of ski-burns in the Word doc. I suppose that was font-specific.)

That's the kind of issue that no software can fix for you, and why it's essential to proof-read. I had a similar issue in a book with the words "clock" and "dock".

AlanHK · 11-05-2015, 12:16 AM

Quote:

Originally Posted by PandathePanda

What I do is to open the pdf then right click and select copy picture, if it's available. Else printscreen. Then paste it into onenote. Then right click on the picture and select copy text from picture.

An easier way to extract all the images from a PDF (especially a "scan-PDF") is with xpdf's pdfimages

Get from http://www.foolabs.com/xpdf/download.html

There may be a GUI way to do it, but the command line:

pdfimages book.pdf -j book
will create a series of images book001.jpg...
from the input file book.pdf.
Usually these will be jpegs, but if the images were bitmaps, it gives you ppm images.
You can convert those to png with e.g Irfanview if you can't read them directly.

This extracts the images as they are stored, so they aren't degraded by recompression.

On the original question; ABBYY Fineviewer is what I use, but it's Windows only.

Kennth · 11-05-2015, 07:31 AM

se abby fine reader or Tesseract OCR engine which is open source for OCR and then convert the files into ePub.

senhal · 11-05-2015, 08:03 AM

Quote:

Originally Posted by AlanHK

On the original question; ABBYY Fineviewer is what I use, but it's Windows only.

I tried FR8 on ubuntu+wine: almost perfect, some minor bugs.

willus · 11-05-2015, 09:00 AM

Quote:

Originally Posted by crankypants

I'd like to convert some PDF files to EPUB but the PDF files are just scanned images, some of the pages are not that great. Example: Book of the Farm. ...

This is a nice thread--lots of good suggestions. Just checking--I figure you know this, and I know it's not the point of the thread, but for that particular example a decent EPUB conversion has already been done and is available on that same web site.
PS. There is another thread on this topic last posted in about a year ago. Also, here is my PDF conversion tips page.

crankypants · 11-10-2015, 08:26 AM

I just found this website: http://pdftotext.com/ which seems to convert PDF files to text pretty reliably, maybe even competing with the Abbyy product. However it doesn't grab images. And I suspect if your PDF is just a series of scanned images (true for most Google public domain books) it won't work. It's not an OCR program, it just extracts the text from the PDF.

eschwartz · 11-10-2015, 03:28 PM

ABBYY is an OCR program.

It isn't hard to extract the actual text from a PDF with text, although getting the paragraph breaks right can be tricky.

crankypants · 12-01-2015, 11:43 AM

I also found that Adobe Acrobat (not the free READER), can also do decent OCR. But you need the full paid version of Acrobat. I used Acrobat X.

10-28-2015, 11:33 AM	#1
crankypants Hmm. Posts: 124 Karma: 2016606 Join Date: Oct 2015 Device: Android 4.2 Google Play Reader	Best practice to OCR and convert PDF to text or html or epub I'd like to convert some PDF files to EPUB but the PDF files are just scanned images, some of the pages are not that great. Example: Book of the Farm. What's your best set of steps using windows or Ubuntu software?

10-28-2015, 02:28 PM	#6
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	The real professionals swear by ABBYY Finereader. But that doesn't come cheap. Of course, depending on how much use you get out of it, it might be worth the expenditure. The best free alternative is the open-source Tesseract OCR engine, which can be used by various graphical frontends. You could try k2pdfopt -- it is a PDF reflow tool that can also embed OCR data using Tesseract. And it's a CLI tool, so easily scriptable. Last edited by eschwartz; 10-28-2015 at 02:32 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Best practice to convert PDF to simple flowing text? Calibre error	avid01	PDF	6	03-31-2017 03:47 AM
Best practice to convert framed HTML to e-reader readable format?	avid01	Workshop	12	06-07-2015 06:03 AM
Convert EPUB to HTML Zip extra meta text	meme	Conversion	2	05-28-2012 01:34 PM

10-28-2015, 11:41 AM	#2
PandathePanda a toy panda Posts: 2,568 Karma: 26020474 Join Date: Mar 2014 Location: Onboard the Queen Anne's Revenge Device: Various Android dvices	I'm on Windows and I use onenote (part of MS Office). I've good to great results depending on how good the scan is.

10-28-2015, 01:21 PM	#3
crankypants Hmm. Posts: 124 Karma: 2016606 Join Date: Oct 2015 Device: Android 4.2 Google Play Reader	I've never used OneNote so I poked around a bit and I don't see a way to open or import a PDF file. How do you do it in OneNote?

10-28-2015, 01:57 PM	#5
crankypants Hmm. Posts: 124 Karma: 2016606 Join Date: Oct 2015 Device: Android 4.2 Google Play Reader	Oh, that won't work for me. I have 500+ scanned pages to convert.

10-29-2015, 06:02 AM	#7
Notjohn mostly an observer Posts: 1,519 Karma: 996810 Join Date: Dec 2012 Device: Kindle	I thought Finereader had a website option for single jobs, but apparently not. The trial version will do 100 pages, but I think only 3 pages at a time. I got mine on a bit of a sale. I think it was under $100, worth it just for the one book I scanned. (Unfortunately, the book was about ski-bums, and Finereader interpreted the lower case m as rn, so I got a lot of ski-burns in the Word doc. I suppose that was font-specific.)

11-05-2015, 07:31 AM	#10
Kennth Junior Member Posts: 7 Karma: 380010 Join Date: Sep 2015 Location: New York Device: none	se abby fine reader or Tesseract OCR engine which is open source for OCR and then convert the files into ePub.

11-10-2015, 08:26 AM	#13
crankypants Hmm. Posts: 124 Karma: 2016606 Join Date: Oct 2015 Device: Android 4.2 Google Play Reader	I just found this website: http://pdftotext.com/ which seems to convert PDF files to text pretty reliably, maybe even competing with the Abbyy product. However it doesn't grab images. And I suspect if your PDF is just a series of scanned images (true for most Google public domain books) it won't work. It's not an OCR program, it just extracts the text from the PDF.

11-10-2015, 03:28 PM	#14
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	ABBYY is an OCR program. It isn't hard to extract the actual text from a PDF with text, although getting the paragraph breaks right can be tricky.

12-01-2015, 11:43 AM	#15
crankypants Hmm. Posts: 124 Karma: 2016606 Join Date: Oct 2015 Device: Android 4.2 Google Play Reader	I also found that Adobe Acrobat (not the free READER), can also do decent OCR. But you need the full paid version of Acrobat. I used Acrobat X.