![]() |
#1 |
Hmm.
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 124
Karma: 2016606
Join Date: Oct 2015
Device: Android 4.2 Google Play Reader
|
Best practice to OCR and convert PDF to text or html or epub
I'd like to convert some PDF files to EPUB but the PDF files are just scanned images, some of the pages are not that great. Example: Book of the Farm.
What's your best set of steps using windows or Ubuntu software? |
![]() |
![]() |
![]() |
#2 |
a toy panda
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,568
Karma: 26020474
Join Date: Mar 2014
Location: Onboard the Queen Anne's Revenge
Device: Various Android dvices
|
I'm on Windows and I use onenote (part of MS Office). I've good to great results depending on how good the scan is.
|
![]() |
![]() |
![]() |
#3 |
Hmm.
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 124
Karma: 2016606
Join Date: Oct 2015
Device: Android 4.2 Google Play Reader
|
I've never used OneNote so I poked around a bit and I don't see a way to open or import a PDF file. How do you do it in OneNote?
|
![]() |
![]() |
![]() |
#4 |
a toy panda
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,568
Karma: 26020474
Join Date: Mar 2014
Location: Onboard the Queen Anne's Revenge
Device: Various Android dvices
|
What I do is to open the pdf then right click and select copy picture, if it's available. Else printscreen. Then paste it into onenote. Then right click on the picture and select copy text from picture.
|
![]() |
![]() |
![]() |
#5 |
Hmm.
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 124
Karma: 2016606
Join Date: Oct 2015
Device: Android 4.2 Google Play Reader
|
Oh, that won't work for me. I have 500+ scanned pages to convert.
|
![]() |
![]() |
![]() |
#6 |
Ex-Helpdesk Junkie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
|
The real professionals swear by ABBYY Finereader. But that doesn't come cheap.
![]() Of course, depending on how much use you get out of it, it might be worth the expenditure. The best free alternative is the open-source Tesseract OCR engine, which can be used by various graphical frontends. You could try k2pdfopt -- it is a PDF reflow tool that can also embed OCR data using Tesseract. And it's a CLI tool, so easily scriptable. ![]() Last edited by eschwartz; 10-28-2015 at 02:32 PM. |
![]() |
![]() |
![]() |
#7 |
mostly an observer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,518
Karma: 987654
Join Date: Dec 2012
Device: Kindle
|
I thought Finereader had a website option for single jobs, but apparently not.
The trial version will do 100 pages, but I think only 3 pages at a time. I got mine on a bit of a sale. I think it was under $100, worth it just for the one book I scanned. (Unfortunately, the book was about ski-bums, and Finereader interpreted the lower case m as rn, so I got a lot of ski-burns in the Word doc. I suppose that was font-specific.) |
![]() |
![]() |
![]() |
#8 |
eBook Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 85,544
Karma: 93383099
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
|
That's the kind of issue that no software can fix for you, and why it's essential to proof-read. I had a similar issue in a book with the words "clock" and "dock".
|
![]() |
![]() |
![]() |
#9 | |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 681
Karma: 929286
Join Date: Apr 2014
Device: PW-3, iPad, Android phone
|
Quote:
Get from http://www.foolabs.com/xpdf/download.html There may be a GUI way to do it, but the command line: pdfimages book.pdf -j book will create a series of images book001.jpg... from the input file book.pdf. Usually these will be jpegs, but if the images were bitmaps, it gives you ppm images. You can convert those to png with e.g Irfanview if you can't read them directly. This extracts the images as they are stored, so they aren't degraded by recompression. On the original question; ABBYY Fineviewer is what I use, but it's Windows only. |
|
![]() |
![]() |
![]() |
#10 |
Junior Member
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7
Karma: 380010
Join Date: Sep 2015
Location: New York
Device: none
|
se abby fine reader or Tesseract OCR engine which is open source for OCR and then convert the files into ePub.
|
![]() |
![]() |
![]() |
#11 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 82
Karma: 25684
Join Date: Sep 2014
Device: Kindle NT
|
|
![]() |
![]() |
![]() |
#12 | |
Fuzzball, the purple cat
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,296
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Quote:
PS. There is another thread on this topic last posted in about a year ago. Also, here is my PDF conversion tips page. Last edited by willus; 11-05-2015 at 09:07 AM. |
|
![]() |
![]() |
![]() |
#13 |
Hmm.
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 124
Karma: 2016606
Join Date: Oct 2015
Device: Android 4.2 Google Play Reader
|
I just found this website: http://pdftotext.com/ which seems to convert PDF files to text pretty reliably, maybe even competing with the Abbyy product. However it doesn't grab images. And I suspect if your PDF is just a series of scanned images (true for most Google public domain books) it won't work. It's not an OCR program, it just extracts the text from the PDF.
|
![]() |
![]() |
![]() |
#14 |
Ex-Helpdesk Junkie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 19,421
Karma: 85400180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
|
ABBYY is an OCR program.
It isn't hard to extract the actual text from a PDF with text, although getting the paragraph breaks right can be tricky. |
![]() |
![]() |
![]() |
#15 |
Hmm.
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 124
Karma: 2016606
Join Date: Oct 2015
Device: Android 4.2 Google Play Reader
|
I also found that Adobe Acrobat (not the free READER), can also do decent OCR. But you need the full paid version of Acrobat. I used Acrobat X.
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Best practice to convert PDF to simple flowing text? Calibre error | avid01 | 6 | 03-31-2017 03:47 AM | |
Best practice to convert framed HTML to e-reader readable format? | avid01 | Workshop | 12 | 06-07-2015 06:03 AM |
Convert EPUB to HTML Zip extra meta text | meme | Conversion | 2 | 05-28-2012 01:34 PM |