MobileRead Forums - View Single Post - Can you OCR the images inside of .pdf files?

Tex2002ans · 07-27-2014, 10:22 AM

Quote:

Originally Posted by klmmc13

I'd like to be able to (hopefully with no-cost/low-cost software) locate the scanned copy of books I own (or Public Domain) and thru whatever processes, end up with a text / rtf/ whatever... file of the words on that page/book.

Is there a tutorial, or section of the Wiki that I've overlooked/not found ??

I just posted a few days ago here, pointing to a lot of the other topics where I discussed OCR in-depth:

https://www.mobileread.com/forums/sho...d.php?t=243021

Overall, I would say it is how much you value your own time.

Typing stuff in manually by hand
- Takes FOREVER.
- No matter how accurate of a typer you are, you WILL make lots of mistakes.
  - I did a really short book like this (~25 pages)... NEVER AGAIN
- You will have to manually type all of the formatting. (Italics, Bold, Smallcaps, Headings, etc. etc.)
  - This is overhead you don't give much thought, but it takes up a very large amount of time.
Going with a lot of the free OCR programs might get you
- OK accuracy (DEFINITELY way faster/better than doing everything by hand from scratch).
- Because it is not as accurate, you will still be spending a lot of time cleaning up and typo checking the text afterwards.
Going with a more expensive OCR program
- Much more accurate
- Finereader costs a nice chunk of change
  - Purchase an older version, Finereader 9/10/11 are all perfectly fine
- The amount of time you save from getting more accurate OCR is well worth the money.
- You will spend a heck of a lot less time editing the final product.
  - This means you can go work on converting WAY more books in the same time period!

I have conversions down to an average of ~8-15 hours to go from OCR -> completed EPUB (I tackle non-fiction economics books, different genres are probably faster/slower, and when you first start out, it will be much slower).

Manually typing in everything, or working from much less accurate OCR, while "free" (as in, I didn't pay any money for tools) would take cost you WAY more in manhours.

Quote:

Originally Posted by Jellby

Hmm... That's one error per 100 characters. There are usually many more than 100 characters per page. It's usually something like 200 every 3 lines, with ~30 lines per page, which would make 2000 characters per page. That is 20 errors per page.

Unless the 1% error rate refers to words, not to characters. Then we can estimate around 250 words per page, and 2-3 errors per page.

Hmmm, well Finereader calls it "Low Confidence Characters" (it highlights those in light blue). I just checked a few pages of the journal I am converting, and the pages were anywhere from 0%-5% of "Low Confidence Characters". (~20-120 out of ~3300 characters per page). Although out of those few percent "unsure", it has guessed nearly all of those correctly.

And remember, pure character accuracy doesn't take into account other a HECK of a lot of formatting in a book as well:

bold/italic/smallcaps
superscript/subscript
chapter headings
paragraph breaks
lists
footnotes
headers/footers
images/formulas/figures/tables
actual hyphen at the end of line or is that just a soft-hyphen
...

A more expensive OCR program would typically handle these much better than the free OCR stuff.

Also, the overall character accuracy depends on what kind of text you are converting.

Cookbooks are probably going to have a lot of lists and fractions and images. I bet Finereader would do a much more accurate job at recognizing and accounting for these than a lot of the free OCR programs out there.

If you are working from scanned older material, working from a crappy picture (lets say, you take it with your phone), or the scans are subpar (people who write/underline in the books, water damage, blotches, etc. etc.), accuracy goes WAY down.

Archive.org scans are probably going to have much higher OCR errors than if you were working from a crisp digital image from a newer book.

Here too, the paid programs will probably handle crappier source material better than the free OCR solutions.

Quote:

Originally Posted by mrmikel

The error rates I found refer by character. That only makes close proofreading more imperative and almost word for word examination is required. Close guesses will work much of the time, but when they don't the whole meaning of the sentence is easily changed.

That is another advantage I found while using something like Finereader.

A lot of these free OCR programs just will export the OCR output. In Finereader, you can use the GUI.

It highlights characters that it is "unsure" about. You can then easily look through and pay much closer attention to THOSE sections only. This saves a massive amount of time, since you don't have to waste much of your time looking at every word in the entire book, and you can focus on that 1-5% that is "unsure".

You also get the dictionary support, so it underlines words that are spelled wrong. (Again, you can focus a lot more attention on these than if you had to closely scrutinize every word under the sun).

You can also QUICKLY A/B compare with the source, you can have a magnification set up. For example, here are two images just showing off the types of A/B compare. Magnification, or side-by-side:

Click image for larger version

Name: MagnificationABCompare1.png
Views: 566
Size: 102.0 KB
ID: 126007

Click image for larger version

Name: MagnificationABCompare2.png
Views: 623
Size: 199.4 KB
ID: 126008

Quote:

Originally Posted by mrmikel

Columns are a particular bugbear. It is very common for parts of paragraphs to be shifted down the page and it can be remarkably hard to spot in casual reading. The tops of pages where the text in the right column is only a line or two is where it often happens.

Ugh... double-column text.... makes me want to pull my hair out. Luckily I rarely have to convert that.

The ABSOLUTE WORST is "newspaper" type material, where they have stories that get cut into pieces and "continue on Page C3". So a single page can have about two or three running stories on it, that connect together like a giant spaghetti monster.