MobileRead Forums - View Single Post - Scanned PDF + steps I've made so far. Need help.

Wolfrott · 10-22-2015, 07:10 AM

Quote:

Originally Posted by ger0g3n

So first of all, hello everyone and in advance I apologise for any mistake I make during this post, I'm from Poland and English is not my native language.

Secondly, I've got a problem hence the post

So I'm a newbie to ebook readers world and I decided to order a Kindle 5 All New Touch 7th Generation ebook reader. Decent prize and thought it was a good choice.

I've seen people having kind of the problems I met with my pdf's but, these were old posts and none of them actually solved my issue.

So I study at Technical University and most ebooks I've got on my PC are crappy scanned pages put in PDF. I've tried to work with it, downloaded billion programs like Calibre(ofc), Wondershare PDFelement, ABBYY PDF Transformer+ and tried to make a readable copy of one of the scanned Physics book for my Kindle. Steps I've made so far.

1. I did the OCR in Wondershare of my pdf and it looks like this(not bad I think even though I know it still contains images)
http://s1130.photobucket.com/user/ge...a/ex1.jpg.html

2. Then I read I should convert my OCR'd pdf to one of the format that Calibre will read and convert to AZW3/MOBI that my Kindle will read nicely so I did convert from Wondershare my OCR'd pdf first to EPUB then tried to DOCX, then to HTML and all the results were the same. So in Calibre it looked like this every single time, no matter what the format was:
http://s1130.photobucket.com/user/ge...a/ex2.jpg.html

3. I tried the same thing with ABBYY and result wasn't much better, all the letters in random places, huge blank spaces(sometimes even on page was whole white and on the next one there was 1 word....etc.)

So my question is:

Is there any way I can make my pdf customized for my Kindle 5 in azw3/mobi format in a way like all other 'normal' ebooks?

Thanks in advance!

ger0g3n

I've found ABBY to be terrible, TBPH. Glaring errors. Adobe's OCR feature is better, but not easy or flawless. PDFs turn out ugly and riddled with errors.

The best trick I've found is to grab the text from the scan using Microsoft OneNote's OCR add on, then manually proofread as always in a Word document, and then go from there converting to whatever format you want using Calibre. Depending on the scanned book's native font, I rarely find errors - the most common I've found are mm, nn, rr mistaken for the latter, i's become 1's, and sometimes "" doesn't get recognised.

So the steps:
I. Scan book.
II. Past pages into OneNote + OCR.
III. Proofread resulting text.

Takes me about a month.