Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > ePub

Notices

Reply
 
Thread Tools Search this Thread
Old 10-28-2015, 11:33 AM   #1
crankypants
Zealot
crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.
 
Posts: 112
Karma: 2016606
Join Date: Oct 2015
Device: Android 4.2 Google Play Reader
Best practice to OCR and convert PDF to text or html or epub

I'd like to convert some PDF files to EPUB but the PDF files are just scanned images, some of the pages are not that great. Example: Book of the Farm.

What's your best set of steps using windows or Ubuntu software?
crankypants is offline   Reply With Quote
Old 10-28-2015, 11:41 AM   #2
PandathePanda
a toy panda
PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.
 
PandathePanda's Avatar
 
Posts: 1,530
Karma: 13824288
Join Date: Mar 2014
Location: The Elizabeth Arkham Asylum for the Criminally Insane
Device: Blackberry, Proline Android Tablet, Samsung Galaxy Neo
I'm on Windows and I use onenote (part of MS Office). I've good to great results depending on how good the scan is.
PandathePanda is offline   Reply With Quote
Old 10-28-2015, 01:21 PM   #3
crankypants
Zealot
crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.
 
Posts: 112
Karma: 2016606
Join Date: Oct 2015
Device: Android 4.2 Google Play Reader
I've never used OneNote so I poked around a bit and I don't see a way to open or import a PDF file. How do you do it in OneNote?
crankypants is offline   Reply With Quote
Old 10-28-2015, 01:31 PM   #4
PandathePanda
a toy panda
PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.PandathePanda ought to be getting tired of karma fortunes by now.
 
PandathePanda's Avatar
 
Posts: 1,530
Karma: 13824288
Join Date: Mar 2014
Location: The Elizabeth Arkham Asylum for the Criminally Insane
Device: Blackberry, Proline Android Tablet, Samsung Galaxy Neo
Quote:
Originally Posted by crankypants View Post
I've never used OneNote so I poked around a bit and I don't see a way to open or import a PDF file. How do you do it in OneNote?
What I do is to open the pdf then right click and select copy picture, if it's available. Else printscreen. Then paste it into onenote. Then right click on the picture and select copy text from picture.
PandathePanda is offline   Reply With Quote
Old 10-28-2015, 01:57 PM   #5
crankypants
Zealot
crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.
 
Posts: 112
Karma: 2016606
Join Date: Oct 2015
Device: Android 4.2 Google Play Reader
Oh, that won't work for me. I have 500+ scanned pages to convert.
crankypants is offline   Reply With Quote
Old 10-28-2015, 02:28 PM   #6
eschwartz
Irrational Optimist
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
eschwartz's Avatar
 
Posts: 18,323
Karma: 76285381
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
The real professionals swear by ABBYY Finereader. But that doesn't come cheap.
Of course, depending on how much use you get out of it, it might be worth the expenditure.


The best free alternative is the open-source Tesseract OCR engine, which can be used by various graphical frontends.
You could try k2pdfopt -- it is a PDF reflow tool that can also embed OCR data using Tesseract.


And it's a CLI tool, so easily scriptable.

Last edited by eschwartz; 10-28-2015 at 02:32 PM.
eschwartz is offline   Reply With Quote
Old 10-29-2015, 06:02 AM   #7
Notjohn
hanging on by fingernails
Notjohn ought to be getting tired of karma fortunes by now.Notjohn ought to be getting tired of karma fortunes by now.Notjohn ought to be getting tired of karma fortunes by now.Notjohn ought to be getting tired of karma fortunes by now.Notjohn ought to be getting tired of karma fortunes by now.Notjohn ought to be getting tired of karma fortunes by now.Notjohn ought to be getting tired of karma fortunes by now.Notjohn ought to be getting tired of karma fortunes by now.Notjohn ought to be getting tired of karma fortunes by now.Notjohn ought to be getting tired of karma fortunes by now.Notjohn ought to be getting tired of karma fortunes by now.
 
Posts: 837
Karma: 344490
Join Date: Dec 2012
Device: Kindle
I thought Finereader had a website option for single jobs, but apparently not.

The trial version will do 100 pages, but I think only 3 pages at a time.

I got mine on a bit of a sale. I think it was under $100, worth it just for the one book I scanned.

(Unfortunately, the book was about ski-bums, and Finereader interpreted the lower case m as rn, so I got a lot of ski-burns in the Word doc. I suppose that was font-specific.)
Notjohn is offline   Reply With Quote
Old 10-29-2015, 11:11 AM   #8
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 75,381
Karma: 62678544
Join Date: Nov 2006
Location: UK
Device: Kindle Voyage, iPad Mini, iPhone 6, MS Surface Pro, N7
Quote:
Originally Posted by Notjohn View Post
(Unfortunately, the book was about ski-bums, and Finereader interpreted the lower case m as rn, so I got a lot of ski-burns in the Word doc. I suppose that was font-specific.)
That's the kind of issue that no software can fix for you, and why it's essential to proof-read. I had a similar issue in a book with the words "clock" and "dock".
HarryT is online now   Reply With Quote
Old 11-05-2015, 12:16 AM   #9
AlanHK
Groupie
AlanHK began at the beginning.
 
AlanHK's Avatar
 
Posts: 151
Karma: 10
Join Date: Apr 2014
Device: Android phone
Quote:
Originally Posted by PandathePanda View Post
What I do is to open the pdf then right click and select copy picture, if it's available. Else printscreen. Then paste it into onenote. Then right click on the picture and select copy text from picture.
An easier way to extract all the images from a PDF (especially a "scan-PDF") is with xpdf's pdfimages

Get from http://www.foolabs.com/xpdf/download.html

There may be a GUI way to do it, but the command line:

pdfimages book.pdf -j book
will create a series of images book001.jpg...
from the input file book.pdf.
Usually these will be jpegs, but if the images were bitmaps, it gives you ppm images.
You can convert those to png with e.g Irfanview if you can't read them directly.

This extracts the images as they are stored, so they aren't degraded by recompression.


On the original question; ABBYY Fineviewer is what I use, but it's Windows only.
AlanHK is offline   Reply With Quote
Old 11-05-2015, 07:31 AM   #10
Kennth
Junior Member
Kennth ought to be getting tired of karma fortunes by now.Kennth ought to be getting tired of karma fortunes by now.Kennth ought to be getting tired of karma fortunes by now.Kennth ought to be getting tired of karma fortunes by now.Kennth ought to be getting tired of karma fortunes by now.Kennth ought to be getting tired of karma fortunes by now.Kennth ought to be getting tired of karma fortunes by now.Kennth ought to be getting tired of karma fortunes by now.Kennth ought to be getting tired of karma fortunes by now.Kennth ought to be getting tired of karma fortunes by now.Kennth ought to be getting tired of karma fortunes by now.
 
Posts: 7
Karma: 380010
Join Date: Sep 2015
Location: New York
Device: none
se abby fine reader or Tesseract OCR engine which is open source for OCR and then convert the files into ePub.
Kennth is offline   Reply With Quote
Old 11-05-2015, 08:03 AM   #11
senhal
Connoisseur
senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.senhal knows what's going on.
 
senhal's Avatar
 
Posts: 60
Karma: 25684
Join Date: Sep 2014
Device: Kindle NT
Quote:
Originally Posted by AlanHK View Post
On the original question; ABBYY Fineviewer is what I use, but it's Windows only.
I tried FR8 on ubuntu+wine: almost perfect, some minor bugs.
senhal is offline   Reply With Quote
Old 11-05-2015, 09:00 AM   #12
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 808
Karma: 3625277
Join Date: Jun 2011
Location: California
Device: Kindle 2, iPad
Quote:
Originally Posted by crankypants View Post
I'd like to convert some PDF files to EPUB but the PDF files are just scanned images, some of the pages are not that great. Example: Book of the Farm. ...
This is a nice thread--lots of good suggestions. Just checking--I figure you know this, and I know it's not the point of the thread, but for that particular example a decent EPUB conversion has already been done and is available on that same web site.
PS. There is another thread on this topic last posted in about a year ago. Also, here is my PDF conversion tips page.

Last edited by willus; 11-05-2015 at 09:07 AM.
willus is offline   Reply With Quote
Old 11-10-2015, 08:26 AM   #13
crankypants
Zealot
crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.
 
Posts: 112
Karma: 2016606
Join Date: Oct 2015
Device: Android 4.2 Google Play Reader
I just found this website: http://pdftotext.com/ which seems to convert PDF files to text pretty reliably, maybe even competing with the Abbyy product. However it doesn't grab images. And I suspect if your PDF is just a series of scanned images (true for most Google public domain books) it won't work. It's not an OCR program, it just extracts the text from the PDF.
crankypants is offline   Reply With Quote
Old 11-10-2015, 03:28 PM   #14
eschwartz
Irrational Optimist
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
eschwartz's Avatar
 
Posts: 18,323
Karma: 76285381
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
ABBYY is an OCR program.

It isn't hard to extract the actual text from a PDF with text, although getting the paragraph breaks right can be tricky.
eschwartz is offline   Reply With Quote
Old 12-01-2015, 11:43 AM   #15
crankypants
Zealot
crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.crankypants ought to be getting tired of karma fortunes by now.
 
Posts: 112
Karma: 2016606
Join Date: Oct 2015
Device: Android 4.2 Google Play Reader
I also found that Adobe Acrobat (not the free READER), can also do decent OCR. But you need the full paid version of Acrobat. I used Acrobat X.
crankypants is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Best practice to convert framed HTML to e-reader readable format? avid01 Workshop 12 06-07-2015 06:03 AM
Best practice to convert PDF to simple flowing text? Calibre error avid01 PDF 3 06-13-2014 05:42 PM
Convert EPUB to HTML Zip extra meta text meme Conversion 2 05-28-2012 01:34 PM


All times are GMT -4. The time now is 04:50 PM.


MobileRead.com is a privately owned, operated and funded community.