MobileRead Forums - View Single Post - Problems converting K2PDF Opt files to EPUB

Tex2002ans · 10-08-2014, 08:48 PM

Quote:

Originally Posted by eschwartz

Our resident OCR expert here on MobileRead, @Tex2002ans, would heartily recommend the investment in purchasing Finereader.

I wrote a lot about OCR in this thread:

https://www.mobileread.com/forums/sho...d.php?t=243327

And many of the pitfalls of the free solutions compared to the paid, and areas where OCR is lacking, and areas where you will have to do a lot of manual fixing.

For a simple novel, the free stuff would probably work just fine... but once you start getting into more complicated books/layouts, things start getting hairy with the free solutions. I have a more detailed list in that post, but things like footnotes, figures/tables, dropcaps, superscript/subscript, etc. etc.

Also, if you follow the pyramid of links to more of my explanation posts, they explain every single thing with OCR, and how to go from PDF -> EPUB.

Quote:

Originally Posted by ittiandro

I tried ABBYY Fine Reader 11 Corporate Ed on my friends's computer

n the New Task tab I've chosen the E-Book, File PDF ( Image) to EPUB option, which seemed to be exactly what I wanted to do, but the EPUB conversion does not render charts and diagrams, only some scribble .In addition, I have not seen any OCR option, unless it kicks in automatically.

Once you open up Finereader, you need to push File - Open PDF File/Image, and find where your PDF is and open it. After you open the PDF, Finereader should look like something along these lines. What you want to do then is press Read:

Click image for larger version

Name: Finereader1.png
Views: 404
Size: 58.3 KB
ID: 129396

Finereader will then take a while trying to figure out the layout of the book (Text/Images/Tables), and OCR the entire book.

Text will get a Green rectangle around it, Images get a Red rectangle, Tables get a Blue rectangle.

Then you will have to manually go through and fix any mistakes Finereader finds in the layout. For example, here you can see that the dropcap 'T' was accidentally recognized as an image (see the red box):

Click image for larger version

Name: Finereader2.png
Views: 345
Size: 79.4 KB
ID: 129397

What you want to do is use the Text/Picture/Table buttons, or readjust the boxes by dragging the edges:

Click image for larger version

Name: Finereader3.png
Views: 337
Size: 81.8 KB
ID: 129398

You can see that an unrecognized box is slightly lighter color (Light Green/Blue/Red). You want to right click on the page, and press "Read Selected Pages":

Click image for larger version

Name: Finereader4.png
Views: 380
Size: 83.0 KB
ID: 129399

Then you have to go through the entire book. Making sure that all your charts are in Image (Red) boxes, all the Text (Green) boxes, and Tables (Blue) boxes.

Quote:

Originally Posted by ittiandro

THe source PDF( Image) file was already a k2PDF OPT conversion from the original PDF scanned file.I don't know if I should have used perhaps this original file.

Always work from as close to the original source as possible. In this case, you have the original PDF, so use it.

Quote:

Originally Posted by Toxaris

You would be better of by OCR it to a Word or HTML file than an ePUB file.

The EPUB export is definitely buggy with footnotes in particular (makes me want to pull my hair out). It tries to automatically create links at the end of the chapters that jump back/forth (like in your typical ebook), but many times entire footnotes just disappear into thin air, or it never "links" them (and just keeps the footnotes in the regular flow of text). Besides that, I haven't ran into many other problems with EPUB output.

Depending on which tools you are more comfortable with, you might work much better in Word. If you do export to DOC(X), I would highly recommend Toxaris's ePUB Tools (see the bottom of his signature).

If you are more comfortable working directly in HTML, you might prefer the EPUB output.

Either way, you would still have to do a lot of A/B checking and fixing.