Can you OCR the images inside of .pdf files?

klmmc13 · 07-25-2014, 04:55 PM

Hello All--

Can you OCR the images inside of .pdf files? Do you have to extract the images somehow to .tiff images, and then put them to an OCR software? HOW??

I'd like to be able to (hopefully with no-cost/low-cost software) locate the scanned copy of books I own (or Public Domain) and thru whatever processes, end up with a text / rtf/ whatever... file of the words on that page/book.

Is there a tutorial, or section of the Wiki that I've overlooked/not found ??

I know I sound like a babbling idiot......

I've downloaded one-too-many public domain epubs from both Amazon and B&N, as well as Google & Internet Archive, that needed to be worked on. I've typed out by hand several books, especially cookbooks, and it's a tedious and long term project. There's got to be a better way!!!

(Moderators- please move this thread to whatever section it belongs in, If I'm in the wrong place.)

TIA
Kathy
MamaDragon

harriska2 · 07-25-2014, 11:05 PM

Kathy, I am a scanner/OCR lover. I use ABBYY Finereader. I just drag the file (PDF) into the software and it does its magic. I can run them for you if you need. You would then compare the opened PDF with the OCR version and correct errors. It can take a bit of time.

mrmikel · 07-26-2014, 11:59 AM

Be aware that a good error rate for OCR is 1%. That translates to an error PER PAGE, so it will take some good proofreading to make sure there are no errors. It would be mighty shame to put in 1 cup when 1/4 cup was in the original!!!!

Jellby · 07-26-2014, 05:03 PM

Quote:

Originally Posted by mrmikel

Be aware that a good error rate for OCR is 1%. That translates to an error PER PAGE

Hmm... That's one error per 100 characters. There are usually many more than 100 characters per page. It's usually something like 200 every 3 lines, with ~30 lines per page, which would make 2000 characters per page. That is 20 errors per page.

Unless the 1% error rate refers to words, not to characters. Then we can estimate around 250 words per page, and 2-3 errors per page.

mrmikel · 07-27-2014, 06:55 AM

I think you are right Jelby. The error rates I found refer by character. That only makes close proofreading more imperative and almost word for word examination is required. Close guesses will work much of the time, but when they don't the whole meaning of the sentence is easily changed.

Columns are a particular bugbear. It is very common for parts of paragraphs to be shifted down the page and it can be remarkably hard to spot in casual reading. The tops of pages where the text in the right column is only a line or two is where it often happens.

Tex2002ans · 07-27-2014, 10:22 AM

Quote:

Originally Posted by klmmc13

I'd like to be able to (hopefully with no-cost/low-cost software) locate the scanned copy of books I own (or Public Domain) and thru whatever processes, end up with a text / rtf/ whatever... file of the words on that page/book.

Is there a tutorial, or section of the Wiki that I've overlooked/not found ??

I just posted a few days ago here, pointing to a lot of the other topics where I discussed OCR in-depth:

https://www.mobileread.com/forums/sho...d.php?t=243021

Overall, I would say it is how much you value your own time.

Typing stuff in manually by hand
- Takes FOREVER.
- No matter how accurate of a typer you are, you WILL make lots of mistakes.
  - I did a really short book like this (~25 pages)... NEVER AGAIN
- You will have to manually type all of the formatting. (Italics, Bold, Smallcaps, Headings, etc. etc.)
  - This is overhead you don't give much thought, but it takes up a very large amount of time.
Going with a lot of the free OCR programs might get you
- OK accuracy (DEFINITELY way faster/better than doing everything by hand from scratch).
- Because it is not as accurate, you will still be spending a lot of time cleaning up and typo checking the text afterwards.
Going with a more expensive OCR program
- Much more accurate
- Finereader costs a nice chunk of change
  - Purchase an older version, Finereader 9/10/11 are all perfectly fine
- The amount of time you save from getting more accurate OCR is well worth the money.
- You will spend a heck of a lot less time editing the final product.
  - This means you can go work on converting WAY more books in the same time period!

I have conversions down to an average of ~8-15 hours to go from OCR -> completed EPUB (I tackle non-fiction economics books, different genres are probably faster/slower, and when you first start out, it will be much slower).

Manually typing in everything, or working from much less accurate OCR, while "free" (as in, I didn't pay any money for tools) would take cost you WAY more in manhours.

Quote:

Originally Posted by Jellby

Hmm... That's one error per 100 characters. There are usually many more than 100 characters per page. It's usually something like 200 every 3 lines, with ~30 lines per page, which would make 2000 characters per page. That is 20 errors per page.

Unless the 1% error rate refers to words, not to characters. Then we can estimate around 250 words per page, and 2-3 errors per page.

Hmmm, well Finereader calls it "Low Confidence Characters" (it highlights those in light blue). I just checked a few pages of the journal I am converting, and the pages were anywhere from 0%-5% of "Low Confidence Characters". (~20-120 out of ~3300 characters per page). Although out of those few percent "unsure", it has guessed nearly all of those correctly.

And remember, pure character accuracy doesn't take into account other a HECK of a lot of formatting in a book as well:

bold/italic/smallcaps
superscript/subscript
chapter headings
paragraph breaks
lists
footnotes
headers/footers
images/formulas/figures/tables
actual hyphen at the end of line or is that just a soft-hyphen
...

A more expensive OCR program would typically handle these much better than the free OCR stuff.

Also, the overall character accuracy depends on what kind of text you are converting.

Cookbooks are probably going to have a lot of lists and fractions and images. I bet Finereader would do a much more accurate job at recognizing and accounting for these than a lot of the free OCR programs out there.

If you are working from scanned older material, working from a crappy picture (lets say, you take it with your phone), or the scans are subpar (people who write/underline in the books, water damage, blotches, etc. etc.), accuracy goes WAY down.

Archive.org scans are probably going to have much higher OCR errors than if you were working from a crisp digital image from a newer book.

Here too, the paid programs will probably handle crappier source material better than the free OCR solutions.

Quote:

Originally Posted by mrmikel

The error rates I found refer by character. That only makes close proofreading more imperative and almost word for word examination is required. Close guesses will work much of the time, but when they don't the whole meaning of the sentence is easily changed.

That is another advantage I found while using something like Finereader.

A lot of these free OCR programs just will export the OCR output. In Finereader, you can use the GUI.

It highlights characters that it is "unsure" about. You can then easily look through and pay much closer attention to THOSE sections only. This saves a massive amount of time, since you don't have to waste much of your time looking at every word in the entire book, and you can focus on that 1-5% that is "unsure".

You also get the dictionary support, so it underlines words that are spelled wrong. (Again, you can focus a lot more attention on these than if you had to closely scrutinize every word under the sun).

You can also QUICKLY A/B compare with the source, you can have a magnification set up. For example, here are two images just showing off the types of A/B compare. Magnification, or side-by-side:

Click image for larger version

Name: MagnificationABCompare1.png
Views: 507
Size: 102.0 KB
ID: 126007

Click image for larger version

Name: MagnificationABCompare2.png
Views: 512
Size: 199.4 KB
ID: 126008

Quote:

Originally Posted by mrmikel

Columns are a particular bugbear. It is very common for parts of paragraphs to be shifted down the page and it can be remarkably hard to spot in casual reading. The tops of pages where the text in the right column is only a line or two is where it often happens.

Ugh... double-column text.... makes me want to pull my hair out. Luckily I rarely have to convert that.

The ABSOLUTE WORST is "newspaper" type material, where they have stories that get cut into pieces and "continue on Page C3". So a single page can have about two or three running stories on it, that connect together like a giant spaghetti monster.

klmmc13 · 07-27-2014, 10:30 AM

Quote:

Originally Posted by harriska2

Kathy, I am a scanner/OCR lover. I use ABBYY Finereader. I just drag the file (PDF) into the software and it does its magic. I can run them for you if you need. You would then compare the opened PDF with the OCR version and correct errors. It can take a bit of time.

Harriska- I've been out of the arena too long... what is the output format that Abbyyfinereader gives you? I'm interested... Can you explain to this Old Dog who's trying to learn new tricks??

I'm looking at working thru books like McGuffeys' Readers, cookbooks, and so forth, that most of the rest of the people who are rejuvenating the Public Domain reading have passed over.

Thanks!
Kathy
MamaDragon

klmmc13 · 07-27-2014, 10:44 AM

Don't you just LOVE cross-posting...?!

Tex2002ans- THANKS!!! I'll have to tough it out for now with something less expensive for now, until I can either find a way to make Finereader pay for itself (performing services for others), or my pocket money funds get a raise....
[in other words - Time I Got.. Funds... not so much]

I'm sure that a LOT of your information will still teach me things I need to know.

MrMikeL - yes, I'm still looking at word-by-word proofing, but that's better MOST of the time than typing it all in from scratch.

Thanks All
Kathy

Tex2002ans · 07-27-2014, 10:45 AM

Quote:

Originally Posted by klmmc13

Harriska- I've been out of the arena too long... what is the output format that Abbyyfinereader gives you?

DOCX, DOC, RTF, ODT, EPUB (This was added in Finereader 10), FB2, HTML, TXT, PDF, DJVU.

Those are probably the most common, there are a few more (XLS, CSV) that probably wouldn't be used in your typical book.

Quote:

Originally Posted by klmmc13

Don't you just LOVE cross-posting...?!

Well when the same stuff gets said again and again (I mean, someone JUST posted this topic last week).... It sort of gets boring having to type out a lot of the same info when it was already covered about a thousand times. :P

Quote:

Originally Posted by klmmc13

Tex2002ans- THANKS!!! I'll have to tough it out for now with something less expensive for now, until I can either find a way to make Finereader pay for itself (performing services for others), or my pocket money funds get a raise....
[in other words - Time I Got.. Funds... not so much]

Well again, just think of it in productivity. You spend a little now, and you just saved yourself HOURS and HOURS of headache. That means WAY more cookbooks can get out there! :P

And let's be serious.... proofing crappy OCR is boring stuff... proofing better OCR is much less boring. The quicker you finish, the more time you can spend actually READING the cookbooks (or cooking)!

You can a copy of Finereader much cheaper off Ebay or something similar:

http://www.ebay.com/sch/i.html?_from...eader&_sacat=0

As I said, just hunt for 9 or 10, they are perfectly fine. No need for 12 (I would actually recommend AGAINST 12). Stick with 9/10/11.

Quote:

Originally Posted by klmmc13

I'm sure that a LOT of your information will still teach me things I need to know.

Heh, definitely don't fear asking. I can probably hunt down some other posts I made explaining all the questions. Although as you can see, I am quite technical. (Perhaps not everyone's cup of tea).

Also, if you use Microsoft Word (2007+), Toxaris came up with this ePUB Tools addon which really speeds things up: https://www.mobileread.com/forums/sho...d.php?t=213372

mrmikel · 07-27-2014, 10:56 AM

In forums, TMI is better than TLI!

u238110 · 07-27-2014, 08:04 PM

Quote:

Originally Posted by Tex2002ans

And remember, pure character accuracy doesn't take into account other a HECK of a lot of formatting in a book as well:

bold/italic/smallcaps
superscript/subscript
chapter headings
paragraph breaks
lists
footnotes
headers/footers
images/formulas/figures/tables
actual hyphen at the end of line or is that just a soft-hyphen
...

Tex2002ans, what would you say is the best guide on how to handle these complicated elements? For EPUB...

Toxaris · 07-28-2014, 01:15 AM

By checking. Some of them are handled correctly by the OCR, but not all. That is one of the reasons that I created my add-on to automate a lot of tasks to fix these (and a lot of other) OCR mistakes. In case of doubt, manual intervention is required.

Tex2002ans · 07-28-2014, 12:00 PM

Quote:

Originally Posted by u238110

Tex2002ans, what would you say is the best guide on how to handle these complicated elements? For EPUB...

Also toss images + tables onto that "complex list" too.

Finereader does a pretty decent job at separating images from the text, and it is pretty dang good at figuring out tables. (Let me tell you, doing tables manually will make you want to kill yourself

).

Here is a list of a bunch of different OCR programs: https://en.wikipedia.org/wiki/Compar...ition_software

There isn't really a "guide", just that in my experience, the Free OCR tools (Tesseract, FreeOCR, etc. etc.), do not recognizing a lot of that "complex" formatting as accurately as something like Finereader.

And it is exactly as Toxaris stated:

Quote:

Originally Posted by Toxaris

By checking. Some of them are handled correctly by the OCR, but not all. That is one of the reasons that I created my add-on to automate a lot of tasks to fix these (and a lot of other) OCR mistakes. In case of doubt, manual intervention is required.

There is just nothing you can do besides manual checking/fixing. PDF was designed as a final/output format, and is dreadful as an INPUT format.

Also, another disadvantage of the free stuff, you are most likely going to have to do A LOT of your own training. For example, here is the training manual for Tesseract:

https://code.google.com/p/tesseract-...ningTesseract3

While the default training included with the program probably works perfectly fine for basic things like novels, and cleaner scans, it will probably require more training if the book you are dealing with has older/more obscure fonts, or when dealing with non-English languages. (Even a lot of "English" books have a lot of accented characters, and letters out of the usual A-Z subset).

In Finereader, you are also paying for the massive amount of training that THEY have already done for you (on the millions and millions of documents they process). This again, will lead to more accurate results than otherwise.

Remember, the more accurate the OCR is, the less time you have to spend actually cleaning up the wrong output.

So with free, sure, it might cost you $0 initially, but then you spend many more hours double-checking/cleaning up the output.

Edit: Actually, now that I reread u238110's post, he MAY have meant how I handle coding those things in actual EPUB.

I explained Tables/Footnotes/Formulas/Figures/Images towards the bottom of this post (with links to the specific topics/real-life examples):

https://www.mobileread.com/forums/sho...68&postcount=8

Headers/Footers can just be trashed. Finereader does a good job at recognizing them in the document, and just allows you to easily export without those included (again, this is an area where the free stuff might lack, and you would have to spend time manually removing).

klmmc13 · 07-30-2014, 12:21 AM

Thanks All, and especially Tex2002ans !!

I do appreciate you all taking the time to answer my query.....

I THINK the answer was Yes, AbbyyFinereader, and possibly a few others are able to OCR the page images within a PDF WITHOUT disassembling the PDF first. If the PDF needs to be disassembled first, I better get someone to teach me how...

FineReader will definitely make onto my Xmas Wish List... In the mean time, I'll work the books that others have already done the heavy lifting on.

My area of concentration / preference are the PD books that a lot of the Homeschoolers use, McGuffey's, Ray's (if I can figure out how to make those math problems do what I want WHERE I want them to do it ), Primary Source Documents for History; Pleasure / Literature Reading for the younger bunch, and so forth.

some of the story chapter books require minor updating, but a lot more are excellent as they are for vocabulary building, as well as "just for fun."

While a lot of the Homeschool crowd are printing PDF's and binding at 5x8; usually it's because they don't have other options. I want to present them with another option -- most, if not all of that years' books ready to load onto a reader.
(When my kids were Homeschooling, they'd much rather have their books on the readers, than a half-size notebook or home-bound edition.)

Again, Thanks All
Kathy MamaDragon

mrmikel · 07-30-2014, 06:44 AM

I think you could break pdfs apart using the combination of ghostscript and gsview. But finding a server that actually works to download gsview can be a challenge.

07-25-2014, 04:55 PM	#1
klmmc13 formatting student Posts: 47 Karma: 38268 Join Date: Dec 2013 Location: South Arkansas, US Device: several models of Kindle; Several Android tablets & 3 Android phones.	Can you OCR the images inside of .pdf files? Hello All-- Can you OCR the images inside of .pdf files? Do you have to extract the images somehow to .tiff images, and then put them to an OCR software? HOW?? I'd like to be able to (hopefully with no-cost/low-cost software) locate the scanned copy of books I own (or Public Domain) and thru whatever processes, end up with a text / rtf/ whatever... file of the words on that page/book. Is there a tutorial, or section of the Wiki that I've overlooked/not found ?? I know I sound like a babbling idiot...... I've downloaded one-too-many public domain epubs from both Amazon and B&N, as well as Google & Internet Archive, that needed to be worked on. I've typed out by hand several books, especially cookbooks, and it's a tedious and long term project. There's got to be a better way!!! (Moderators- please move this thread to whatever section it belongs in, If I'm in the wrong place.) TIA Kathy MamaDragon

07-27-2014, 06:55 AM	#5
mrmikel Color me gone Posts: 2,089 Karma: 1445295 Join Date: Apr 2008 Location: Central Oregon Coast Device: PRS-300	I think you are right Jelby. The error rates I found refer by character. That only makes close proofreading more imperative and almost word for word examination is required. Close guesses will work much of the time, but when they don't the whole meaning of the sentence is easily changed. Columns are a particular bugbear. It is very common for parts of paragraphs to be shifted down the page and it can be remarkably hard to spot in casual reading. The tops of pages where the text in the right column is only a line or two is where it often happens. Last edited by mrmikel; 07-27-2014 at 07:00 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
no text extraction for pdf with images and OCR	fxp33	Conversion	7	12-15-2015 07:22 AM
Cover images for pdf files on Kindle PW	blz777	Amazon Kindle	0	07-21-2013 10:45 AM
Google Adds OCR for PDF Files	kjk	News	0	06-22-2010 02:27 PM
Can I view images in PDF files ?	eisho	Sony Reader	1	08-03-2008 08:49 PM
Sony reader for PDF files: pages as images	claudioita	Sony Reader	3	07-30-2007 02:46 PM

07-25-2014, 11:05 PM	#2
harriska2 Addict Posts: 272 Karma: 8000000 Join Date: Oct 2010 Location: Corvallis, OR Device: Kindle PW2, iPad Pro	Kathy, I am a scanner/OCR lover. I use ABBYY Finereader. I just drag the file (PDF) into the software and it does its magic. I can run them for you if you need. You would then compare the opened PDF with the OCR version and correct errors. It can take a bit of time.

07-26-2014, 11:59 AM	#3
mrmikel Color me gone Posts: 2,089 Karma: 1445295 Join Date: Apr 2008 Location: Central Oregon Coast Device: PRS-300	Be aware that a good error rate for OCR is 1%. That translates to an error PER PAGE, so it will take some good proofreading to make sure there are no errors. It would be mighty shame to put in 1 cup when 1/4 cup was in the original!!!!

07-27-2014, 10:44 AM	#8
klmmc13 formatting student Posts: 47 Karma: 38268 Join Date: Dec 2013 Location: South Arkansas, US Device: several models of Kindle; Several Android tablets & 3 Android phones.	Don't you just LOVE cross-posting...?! Tex2002ans- THANKS!!! I'll have to tough it out for now with something less expensive for now, until I can either find a way to make Finereader pay for itself (performing services for others), or my pocket money funds get a raise.... [in other words - Time I Got.. Funds... not so much] I'm sure that a LOT of your information will still teach me things I need to know. MrMikeL - yes, I'm still looking at word-by-word proofing, but that's better MOST of the time than typing it all in from scratch. Thanks All Kathy

07-27-2014, 10:56 AM	#10
mrmikel Color me gone Posts: 2,089 Karma: 1445295 Join Date: Apr 2008 Location: Central Oregon Coast Device: PRS-300	In forums, TMI is better than TLI!

07-28-2014, 01:15 AM	#12
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	By checking. Some of them are handled correctly by the OCR, but not all. That is one of the reasons that I created my add-on to automate a lot of tasks to fix these (and a lot of other) OCR mistakes. In case of doubt, manual intervention is required.

07-30-2014, 12:21 AM	#14
klmmc13 formatting student Posts: 47 Karma: 38268 Join Date: Dec 2013 Location: South Arkansas, US Device: several models of Kindle; Several Android tablets & 3 Android phones.	Thanks All, and especially Tex2002ans !! I do appreciate you all taking the time to answer my query..... I THINK the answer was Yes, AbbyyFinereader, and possibly a few others are able to OCR the page images within a PDF WITHOUT disassembling the PDF first. If the PDF needs to be disassembled first, I better get someone to teach me how... FineReader will definitely make onto my Xmas Wish List... In the mean time, I'll work the books that others have already done the heavy lifting on. My area of concentration / preference are the PD books that a lot of the Homeschoolers use, McGuffey's, Ray's (if I can figure out how to make those math problems do what I want WHERE I want them to do it ), Primary Source Documents for History; Pleasure / Literature Reading for the younger bunch, and so forth. some of the story chapter books require minor updating, but a lot more are excellent as they are for vocabulary building, as well as "just for fun." While a lot of the Homeschool crowd are printing PDF's and binding at 5x8; usually it's because they don't have other options. I want to present them with another option -- most, if not all of that years' books ready to load onto a reader. (When my kids were Homeschooling, they'd much rather have their books on the readers, than a half-size notebook or home-bound edition.) Again, Thanks All Kathy MamaDragon

07-30-2014, 06:44 AM	#15
mrmikel Color me gone Posts: 2,089 Karma: 1445295 Join Date: Apr 2008 Location: Central Oregon Coast Device: PRS-300	I think you could break pdfs apart using the combination of ghostscript and gsview. But finding a server that actually works to download gsview can be a challenge.

Advert

Advert