Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 07-25-2014, 05:55 PM   #1
klmmc13
Member
klmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
Posts: 21
Karma: 12696
Join Date: Dec 2013
Device: nook color
Question Can you OCR the images inside of .pdf files?

Hello All--

Can you OCR the images inside of .pdf files? Do you have to extract the images somehow to .tiff images, and then put them to an OCR software? HOW??

I'd like to be able to (hopefully with no-cost/low-cost software) locate the scanned copy of books I own (or Public Domain) and thru whatever processes, end up with a text / rtf/ whatever... file of the words on that page/book.

Is there a tutorial, or section of the Wiki that I've overlooked/not found ??

I know I sound like a babbling idiot......

I've downloaded one-too-many public domain epubs from both Amazon and B&N, as well as Google & Internet Archive, that needed to be worked on. I've typed out by hand several books, especially cookbooks, and it's a tedious and long term project. There's got to be a better way!!!

(Moderators- please move this thread to whatever section it belongs in, If I'm in the wrong place.)

TIA
Kathy
MamaDragon
klmmc13 is offline   Reply With Quote
Old 07-26-2014, 12:05 AM   #2
harriska2
Connoisseur
harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.
 
Posts: 92
Karma: 648532
Join Date: Oct 2010
Location: Corvallis, OR
Device: None
Kathy, I am a scanner/OCR lover. I use ABBYY Finereader. I just drag the file (PDF) into the software and it does its magic. I can run them for you if you need. You would then compare the opened PDF with the OCR version and correct errors. It can take a bit of time.
harriska2 is offline   Reply With Quote
 
Advertisement
Old 07-26-2014, 12:59 PM   #3
mrmikel
Color me gone
mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.
 
Posts: 2,086
Karma: 1444487
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
Be aware that a good error rate for OCR is 1%. That translates to an error PER PAGE, so it will take some good proofreading to make sure there are no errors. It would be mighty shame to put in 1 cup when 1/4 cup was in the original!!!!
mrmikel is offline   Reply With Quote
Old 07-26-2014, 06:03 PM   #4
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 6,310
Karma: 4898871
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by mrmikel View Post
Be aware that a good error rate for OCR is 1%. That translates to an error PER PAGE
Hmm... That's one error per 100 characters. There are usually many more than 100 characters per page. It's usually something like 200 every 3 lines, with ~30 lines per page, which would make 2000 characters per page. That is 20 errors per page.

Unless the 1% error rate refers to words, not to characters. Then we can estimate around 250 words per page, and 2-3 errors per page.
Jellby is offline   Reply With Quote
Old 07-27-2014, 07:55 AM   #5
mrmikel
Color me gone
mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.
 
Posts: 2,086
Karma: 1444487
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
I think you are right Jelby. The error rates I found refer by character. That only makes close proofreading more imperative and almost word for word examination is required. Close guesses will work much of the time, but when they don't the whole meaning of the sentence is easily changed.

Columns are a particular bugbear. It is very common for parts of paragraphs to be shifted down the page and it can be remarkably hard to spot in casual reading. The tops of pages where the text in the right column is only a line or two is where it often happens.

Last edited by mrmikel; 07-27-2014 at 08:00 AM.
mrmikel is offline   Reply With Quote
Old 07-27-2014, 11:22 AM   #6
Tex2002ans
Fanatic
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 539
Karma: 562971
Join Date: Jul 2012
Device: Nook
Quote:
Originally Posted by klmmc13 View Post
I'd like to be able to (hopefully with no-cost/low-cost software) locate the scanned copy of books I own (or Public Domain) and thru whatever processes, end up with a text / rtf/ whatever... file of the words on that page/book.

Is there a tutorial, or section of the Wiki that I've overlooked/not found ??
I just posted a few days ago here, pointing to a lot of the other topics where I discussed OCR in-depth:

http://www.mobileread.com/forums/sho...d.php?t=243021

Overall, I would say it is how much you value your own time.
  • Typing stuff in manually by hand
    • Takes FOREVER.
    • No matter how accurate of a typer you are, you WILL make lots of mistakes.
      • I did a really short book like this (~25 pages)... NEVER AGAIN
    • You will have to manually type all of the formatting. (Italics, Bold, Smallcaps, Headings, etc. etc.)
      • This is overhead you don't give much thought, but it takes up a very large amount of time.
  • Going with a lot of the free OCR programs might get you
    • OK accuracy (DEFINITELY way faster/better than doing everything by hand from scratch).
    • Because it is not as accurate, you will still be spending a lot of time cleaning up and typo checking the text afterwards.
  • Going with a more expensive OCR program
    • Much more accurate
    • Finereader costs a nice chunk of change
      • Purchase an older version, Finereader 9/10/11 are all perfectly fine
    • The amount of time you save from getting more accurate OCR is well worth the money.
    • You will spend a heck of a lot less time editing the final product.
      • This means you can go work on converting WAY more books in the same time period!

I have conversions down to an average of ~8-15 hours to go from OCR -> completed EPUB (I tackle non-fiction economics books, different genres are probably faster/slower, and when you first start out, it will be much slower).

Manually typing in everything, or working from much less accurate OCR, while "free" (as in, I didn't pay any money for tools) would take cost you WAY more in manhours.

Quote:
Originally Posted by Jellby View Post
Hmm... That's one error per 100 characters. There are usually many more than 100 characters per page. It's usually something like 200 every 3 lines, with ~30 lines per page, which would make 2000 characters per page. That is 20 errors per page.

Unless the 1% error rate refers to words, not to characters. Then we can estimate around 250 words per page, and 2-3 errors per page.
Hmmm, well Finereader calls it "Low Confidence Characters" (it highlights those in light blue). I just checked a few pages of the journal I am converting, and the pages were anywhere from 0%-5% of "Low Confidence Characters". (~20-120 out of ~3300 characters per page). Although out of those few percent "unsure", it has guessed nearly all of those correctly.

And remember, pure character accuracy doesn't take into account other a HECK of a lot of formatting in a book as well:
  • bold/italic/smallcaps
  • superscript/subscript
  • chapter headings
  • paragraph breaks
  • lists
  • footnotes
  • headers/footers
  • images/formulas/figures/tables
  • actual hyphen at the end of line or is that just a soft-hyphen
  • ...

A more expensive OCR program would typically handle these much better than the free OCR stuff.

Also, the overall character accuracy depends on what kind of text you are converting.

Cookbooks are probably going to have a lot of lists and fractions and images. I bet Finereader would do a much more accurate job at recognizing and accounting for these than a lot of the free OCR programs out there.

If you are working from scanned older material, working from a crappy picture (lets say, you take it with your phone), or the scans are subpar (people who write/underline in the books, water damage, blotches, etc. etc.), accuracy goes WAY down.

Archive.org scans are probably going to have much higher OCR errors than if you were working from a crisp digital image from a newer book.

Here too, the paid programs will probably handle crappier source material better than the free OCR solutions.

Quote:
Originally Posted by mrmikel View Post
The error rates I found refer by character. That only makes close proofreading more imperative and almost word for word examination is required. Close guesses will work much of the time, but when they don't the whole meaning of the sentence is easily changed.
That is another advantage I found while using something like Finereader.

A lot of these free OCR programs just will export the OCR output. In Finereader, you can use the GUI.

It highlights characters that it is "unsure" about. You can then easily look through and pay much closer attention to THOSE sections only. This saves a massive amount of time, since you don't have to waste much of your time looking at every word in the entire book, and you can focus on that 1-5% that is "unsure".

You also get the dictionary support, so it underlines words that are spelled wrong. (Again, you can focus a lot more attention on these than if you had to closely scrutinize every word under the sun).

You can also QUICKLY A/B compare with the source, you can have a magnification set up. For example, here are two images just showing off the types of A/B compare. Magnification, or side-by-side:

Click image for larger version

Name:	MagnificationABCompare1.png
Views:	50
Size:	102.0 KB
ID:	126007 Click image for larger version

Name:	MagnificationABCompare2.png
Views:	46
Size:	199.4 KB
ID:	126008

Quote:
Originally Posted by mrmikel View Post
Columns are a particular bugbear. It is very common for parts of paragraphs to be shifted down the page and it can be remarkably hard to spot in casual reading. The tops of pages where the text in the right column is only a line or two is where it often happens.
Ugh... double-column text.... makes me want to pull my hair out. Luckily I rarely have to convert that.

The ABSOLUTE WORST is "newspaper" type material, where they have stories that get cut into pieces and "continue on Page C3". So a single page can have about two or three running stories on it, that connect together like a giant spaghetti monster.

Last edited by Tex2002ans; 07-27-2014 at 11:57 AM. Reason: Added Images
Tex2002ans is offline   Reply With Quote
Old 07-27-2014, 11:30 AM   #7
klmmc13
Member
klmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
Posts: 21
Karma: 12696
Join Date: Dec 2013
Device: nook color
Wink

Quote:
Originally Posted by harriska2 View Post
Kathy, I am a scanner/OCR lover. I use ABBYY Finereader. I just drag the file (PDF) into the software and it does its magic. I can run them for you if you need. You would then compare the opened PDF with the OCR version and correct errors. It can take a bit of time.
Harriska- I've been out of the arena too long... what is the output format that Abbyyfinereader gives you? I'm interested... Can you explain to this Old Dog who's trying to learn new tricks??

I'm looking at working thru books like McGuffeys' Readers, cookbooks, and so forth, that most of the rest of the people who are rejuvenating the Public Domain reading have passed over.

Thanks!
Kathy
MamaDragon
klmmc13 is offline   Reply With Quote
Old 07-27-2014, 11:44 AM   #8
klmmc13
Member
klmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
Posts: 21
Karma: 12696
Join Date: Dec 2013
Device: nook color
Don't you just LOVE cross-posting...?!

Tex2002ans- THANKS!!! I'll have to tough it out for now with something less expensive for now, until I can either find a way to make Finereader pay for itself (performing services for others), or my pocket money funds get a raise....
[in other words - Time I Got.. Funds... not so much]

I'm sure that a LOT of your information will still teach me things I need to know.

MrMikeL - yes, I'm still looking at word-by-word proofing, but that's better MOST of the time than typing it all in from scratch.

Thanks All
Kathy
klmmc13 is offline   Reply With Quote
Old 07-27-2014, 11:45 AM   #9
Tex2002ans
Fanatic
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 539
Karma: 562971
Join Date: Jul 2012
Device: Nook
Quote:
Originally Posted by klmmc13 View Post
Harriska- I've been out of the arena too long... what is the output format that Abbyyfinereader gives you?
DOCX, DOC, RTF, ODT, EPUB (This was added in Finereader 10), FB2, HTML, TXT, PDF, DJVU.

Those are probably the most common, there are a few more (XLS, CSV) that probably wouldn't be used in your typical book.

Quote:
Originally Posted by klmmc13 View Post
Don't you just LOVE cross-posting...?!
Well when the same stuff gets said again and again (I mean, someone JUST posted this topic last week).... It sort of gets boring having to type out a lot of the same info when it was already covered about a thousand times. :P

Quote:
Originally Posted by klmmc13 View Post
Tex2002ans- THANKS!!! I'll have to tough it out for now with something less expensive for now, until I can either find a way to make Finereader pay for itself (performing services for others), or my pocket money funds get a raise....
[in other words - Time I Got.. Funds... not so much]
Well again, just think of it in productivity. You spend a little now, and you just saved yourself HOURS and HOURS of headache. That means WAY more cookbooks can get out there! :P

And let's be serious.... proofing crappy OCR is boring stuff... proofing better OCR is much less boring. The quicker you finish, the more time you can spend actually READING the cookbooks (or cooking)!

You can a copy of Finereader much cheaper off Ebay or something similar:

http://www.ebay.com/sch/i.html?_from...eader&_sacat=0

As I said, just hunt for 9 or 10, they are perfectly fine. No need for 12 (I would actually recommend AGAINST 12). Stick with 9/10/11.

Quote:
Originally Posted by klmmc13 View Post
I'm sure that a LOT of your information will still teach me things I need to know.
Heh, definitely don't fear asking. I can probably hunt down some other posts I made explaining all the questions. Although as you can see, I am quite technical. (Perhaps not everyone's cup of tea).

Also, if you use Microsoft Word (2007+), Toxaris came up with this ePUB Tools addon which really speeds things up: http://www.mobileread.com/forums/sho...d.php?t=213372

Last edited by Tex2002ans; 07-27-2014 at 11:56 AM.
Tex2002ans is offline   Reply With Quote
Old 07-27-2014, 11:56 AM   #10
mrmikel
Color me gone
mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.
 
Posts: 2,086
Karma: 1444487
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
In forums, TMI is better than TLI!
mrmikel is offline   Reply With Quote
Old 07-27-2014, 09:04 PM   #11
u238110
Connoisseur
u238110 began at the beginning.
 
Posts: 54
Karma: 10
Join Date: Feb 2014
Location: Long Island, NY
Device: none
Quote:
Originally Posted by Tex2002ans View Post
And remember, pure character accuracy doesn't take into account other a HECK of a lot of formatting in a book as well:
  • bold/italic/smallcaps
  • superscript/subscript
  • chapter headings
  • paragraph breaks
  • lists
  • footnotes
  • headers/footers
  • images/formulas/figures/tables
  • actual hyphen at the end of line or is that just a soft-hyphen
  • ...
Tex2002ans, what would you say is the best guide on how to handle these complicated elements? For EPUB...
u238110 is offline   Reply With Quote
Old 07-28-2014, 02:15 AM   #12
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 3,182
Karma: 7180223
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-300, PRS-T1
By checking. Some of them are handled correctly by the OCR, but not all. That is one of the reasons that I created my add-on to automate a lot of tasks to fix these (and a lot of other) OCR mistakes. In case of doubt, manual intervention is required.
Toxaris is offline   Reply With Quote
Old 07-28-2014, 01:00 PM   #13
Tex2002ans
Fanatic
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 539
Karma: 562971
Join Date: Jul 2012
Device: Nook
Quote:
Originally Posted by u238110 View Post
Tex2002ans, what would you say is the best guide on how to handle these complicated elements? For EPUB...
Also toss images + tables onto that "complex list" too.

Finereader does a pretty decent job at separating images from the text, and it is pretty dang good at figuring out tables. (Let me tell you, doing tables manually will make you want to kill yourself ).

Here is a list of a bunch of different OCR programs: https://en.wikipedia.org/wiki/Compar...ition_software

There isn't really a "guide", just that in my experience, the Free OCR tools (Tesseract, FreeOCR, etc. etc.), do not recognizing a lot of that "complex" formatting as accurately as something like Finereader.

And it is exactly as Toxaris stated:

Quote:
Originally Posted by Toxaris View Post
By checking. Some of them are handled correctly by the OCR, but not all. That is one of the reasons that I created my add-on to automate a lot of tasks to fix these (and a lot of other) OCR mistakes. In case of doubt, manual intervention is required.
There is just nothing you can do besides manual checking/fixing. PDF was designed as a final/output format, and is dreadful as an INPUT format.

Also, another disadvantage of the free stuff, you are most likely going to have to do A LOT of your own training. For example, here is the training manual for Tesseract:

https://code.google.com/p/tesseract-...ningTesseract3

While the default training included with the program probably works perfectly fine for basic things like novels, and cleaner scans, it will probably require more training if the book you are dealing with has older/more obscure fonts, or when dealing with non-English languages. (Even a lot of "English" books have a lot of accented characters, and letters out of the usual A-Z subset).

In Finereader, you are also paying for the massive amount of training that THEY have already done for you (on the millions and millions of documents they process). This again, will lead to more accurate results than otherwise.

Remember, the more accurate the OCR is, the less time you have to spend actually cleaning up the wrong output.

So with free, sure, it might cost you $0 initially, but then you spend many more hours double-checking/cleaning up the output.

Edit: Actually, now that I reread u238110's post, he MAY have meant how I handle coding those things in actual EPUB.

I explained Tables/Footnotes/Formulas/Figures/Images towards the bottom of this post (with links to the specific topics/real-life examples):

http://www.mobileread.com/forums/sho...68&postcount=8

Headers/Footers can just be trashed. Finereader does a good job at recognizing them in the document, and just allows you to easily export without those included (again, this is an area where the free stuff might lack, and you would have to spend time manually removing).
Tex2002ans is offline   Reply With Quote
Old 07-30-2014, 01:21 AM   #14
klmmc13
Member
klmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterklmmc13 can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
Posts: 21
Karma: 12696
Join Date: Dec 2013
Device: nook color
Thanks All, and especially Tex2002ans !!

I do appreciate you all taking the time to answer my query.....

I THINK the answer was Yes, AbbyyFinereader, and possibly a few others are able to OCR the page images within a PDF WITHOUT disassembling the PDF first. If the PDF needs to be disassembled first, I better get someone to teach me how...

FineReader will definitely make onto my Xmas Wish List... In the mean time, I'll work the books that others have already done the heavy lifting on.


My area of concentration / preference are the PD books that a lot of the Homeschoolers use, McGuffey's, Ray's (if I can figure out how to make those math problems do what I want WHERE I want them to do it ), Primary Source Documents for History; Pleasure / Literature Reading for the younger bunch, and so forth.

some of the story chapter books require minor updating, but a lot more are excellent as they are for vocabulary building, as well as "just for fun."

While a lot of the Homeschool crowd are printing PDF's and binding at 5x8; usually it's because they don't have other options. I want to present them with another option -- most, if not all of that years' books ready to load onto a reader.
(When my kids were Homeschooling, they'd much rather have their books on the readers, than a half-size notebook or home-bound edition.)

Again, Thanks All
Kathy MamaDragon
klmmc13 is offline   Reply With Quote
Old 07-30-2014, 07:44 AM   #15
mrmikel
Color me gone
mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.
 
Posts: 2,086
Karma: 1444487
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
I think you could break pdfs apart using the combination of ghostscript and gsview. But finding a server that actually works to download gsview can be a challenge.
mrmikel is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Cover images for pdf files on Kindle PW blz777 Amazon Kindle 0 07-21-2013 11:45 AM
no text extraction for pdf with images and OCR fxp33 Conversion 6 05-09-2013 04:51 AM
Google Adds OCR for PDF Files kjk News 0 06-22-2010 03:27 PM
Can I view images in PDF files ? eisho Sony Reader 1 08-03-2008 09:49 PM
Sony reader for PDF files: pages as images claudioita Sony Reader 3 07-30-2007 03:46 PM


All times are GMT -4. The time now is 01:24 AM.


MobileRead.com is a privately owned, operated and funded community.