|  10-28-2015, 11:33 AM | #1 | 
| Hmm.            Posts: 124 Karma: 2016606 Join Date: Oct 2015 Device: Android 4.2 Google Play Reader | 
				
				Best practice to OCR and convert PDF to text or html or epub
			 
			
			I'd like to convert some PDF files to EPUB but the PDF files are just scanned images, some of the pages are not that great. Example: Book of the Farm.  What's your best set of steps using windows or Ubuntu software? | 
|   |   | 
|  10-28-2015, 11:41 AM | #2 | 
| a toy panda            Posts: 2,568 Karma: 26020474 Join Date: Mar 2014 Location: Onboard the Queen Anne's Revenge Device: Various Android dvices | 
			
			I'm on Windows and I use onenote (part of MS Office). I've good to great results depending on how good the scan is.
		 | 
|   |   | 
| Advert | |
|  | 
|  10-28-2015, 01:21 PM | #3 | 
| Hmm.            Posts: 124 Karma: 2016606 Join Date: Oct 2015 Device: Android 4.2 Google Play Reader | 
			
			I've never used OneNote so I poked around a bit and I don't see a way to open or import a PDF file. How do you do it in OneNote?
		 | 
|   |   | 
|  10-28-2015, 01:31 PM | #4 | 
| a toy panda            Posts: 2,568 Karma: 26020474 Join Date: Mar 2014 Location: Onboard the Queen Anne's Revenge Device: Various Android dvices | 
			
			What I do is to open the pdf then right click and select copy picture, if it's available. Else printscreen. Then paste it into onenote. Then right click on the picture and select copy text from picture.
		 | 
|   |   | 
|  10-28-2015, 01:57 PM | #5 | 
| Hmm.            Posts: 124 Karma: 2016606 Join Date: Oct 2015 Device: Android 4.2 Google Play Reader | 
			
			Oh, that won't work for me. I have 500+ scanned pages to convert.
		 | 
|   |   | 
| Advert | |
|  | 
|  10-28-2015, 02:28 PM | #6 | 
| Ex-Helpdesk Junkie            Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only) | 
			
			The real professionals swear by ABBYY Finereader. But that doesn't come cheap.   Of course, depending on how much use you get out of it, it might be worth the expenditure. The best free alternative is the open-source Tesseract OCR engine, which can be used by various graphical frontends. You could try k2pdfopt -- it is a PDF reflow tool that can also embed OCR data using Tesseract. And it's a CLI tool, so easily scriptable.   Last edited by eschwartz; 10-28-2015 at 02:32 PM. | 
|   |   | 
|  10-29-2015, 06:02 AM | #7 | 
| mostly an observer            Posts: 1,519 Karma: 996810 Join Date: Dec 2012 Device: Kindle | 
			
			I thought Finereader had a website option for single jobs, but apparently not. The trial version will do 100 pages, but I think only 3 pages at a time. I got mine on a bit of a sale. I think it was under $100, worth it just for the one book I scanned. (Unfortunately, the book was about ski-bums, and Finereader interpreted the lower case m as rn, so I got a lot of ski-burns in the Word doc. I suppose that was font-specific.) | 
|   |   | 
|  10-29-2015, 11:11 AM | #8 | 
| eBook Enthusiast            Posts: 85,560 Karma: 93980341 Join Date: Nov 2006 Location: UK Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6 | 
			
			That's the kind of issue that no software can fix for you, and why it's essential to proof-read. I had a similar issue in a book with the words "clock" and "dock".
		 | 
|   |   | 
|  11-05-2015, 12:16 AM | #9 | |
| Guru            Posts: 681 Karma: 929286 Join Date: Apr 2014 Device: PW-3, iPad, Android phone | Quote: 
 Get from http://www.foolabs.com/xpdf/download.html There may be a GUI way to do it, but the command line: pdfimages book.pdf -j book will create a series of images book001.jpg... from the input file book.pdf. Usually these will be jpegs, but if the images were bitmaps, it gives you ppm images. You can convert those to png with e.g Irfanview if you can't read them directly. This extracts the images as they are stored, so they aren't degraded by recompression. On the original question; ABBYY Fineviewer is what I use, but it's Windows only. | |
|   |   | 
|  11-05-2015, 07:31 AM | #10 | 
| Junior Member            Posts: 7 Karma: 380010 Join Date: Sep 2015 Location: New York Device: none | 
			
			se abby fine reader or Tesseract OCR engine which is open source for OCR and then convert the files into ePub.
		 | 
|   |   | 
|  11-05-2015, 08:03 AM | #11 | 
| Connoisseur            Posts: 82 Karma: 25684 Join Date: Sep 2014 Device: Kindle NT | |
|   |   | 
|  11-05-2015, 09:00 AM | #12 | |
| Fuzzball, the purple cat            Posts: 1,312 Karma: 11087488 Join Date: Jun 2011 Location: California Device: iPad | Quote: 
 PS. There is another thread on this topic last posted in about a year ago. Also, here is my PDF conversion tips page. Last edited by willus; 11-05-2015 at 09:07 AM. | |
|   |   | 
|  11-10-2015, 08:26 AM | #13 | 
| Hmm.            Posts: 124 Karma: 2016606 Join Date: Oct 2015 Device: Android 4.2 Google Play Reader | 
			
			I just found this website: http://pdftotext.com/ which seems to convert PDF files to text pretty reliably, maybe even competing with the Abbyy  product. However it doesn't grab images. And I suspect if your PDF is just a series of scanned images (true for most Google public domain books) it won't work. It's not an OCR program, it just extracts the text from the PDF.
		 | 
|   |   | 
|  11-10-2015, 03:28 PM | #14 | 
| Ex-Helpdesk Junkie            Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only) | 
			
			ABBYY is an OCR program. It isn't hard to extract the actual text from a PDF with text, although getting the paragraph breaks right can be tricky. | 
|   |   | 
|  12-01-2015, 11:43 AM | #15 | 
| Hmm.            Posts: 124 Karma: 2016606 Join Date: Oct 2015 Device: Android 4.2 Google Play Reader | 
			
			I also found that Adobe Acrobat (not the free READER), can also do decent OCR. But you need the full paid version of Acrobat. I used Acrobat X.
		 | 
|   |   | 
|  | 
| 
 | 
|  Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| Best practice to convert PDF to simple flowing text? Calibre error | avid01 | 6 | 03-31-2017 03:47 AM | |
| Best practice to convert framed HTML to e-reader readable format? | avid01 | Workshop | 12 | 06-07-2015 06:03 AM | 
| Convert EPUB to HTML Zip extra meta text | meme | Conversion | 2 | 05-28-2012 01:34 PM |