Quote:
Originally Posted by Marcy
Sorry for hijacking the thread, but how long does it take you to do an entire paperback? How are you scanning?
The only way I see to do this with any speed is to take apart the book so the pages could be put through an ADF instead of having to turn the pages and flatten the book each time. Is that what you're doing or do you have an alternative?
|
There are a lot of methods of digitizing text at
http://www.diybookscanner.org/
There is pretty much:
- Destructive
- Cut off binding, feed through a machine.
- Advantage: Fast, high quality scans.
- Disadvantage: You "lose" the book. (you just get sheets of paper out of it)
- Non-destructive
- Take Images using a camera
- Advantage: Fast
- Disadvantage: Might not be high enough resolution/DPI (may look fine to the human eye, but be inaccurate when OCRed). Depending on your setup, you may get inconsistent images.
- Scanner
- Advantage: High quality.
- Disadvantage: SLOOOOOOOW
Quote:
Originally Posted by rkomar
Maybe you're just guesstimating the accuracy, but 95% is not good. 95% for characters is terrible, and 95% for words is marginally acceptable. A typical printed page has something like 50 characters per line and 40 lines per page, so about 2000 characters per page. A 95% success rate per character would result in about 100 bad characters per page. A 95% success rate per word would bring that down to about 20 or 25 bad words per page. Even 99% accuracy produces more errors than most people like. You'd have to get to about 99.9% accuracy before you could think about not proofing the text afterwards.
|
Even "99.9%" accuracy is an unacceptable amount of errors when reading. I just completed a 430 page non-fiction economics book, the character count is 854196 characters. 99.9% accuracy means that there would be ~850 errors. I do not believe these OCR "accuracies" the companies throw out takes into account formatting errors (wrong italic/bold/superscript/subscript, ...) which get introduced as well.
Then on top of the OCR, you have to fix broken paragraphs, add in proper indentation, check for missing quotation marks, adding in blockquotes, check for actual typos/errors in the physical/PDF book, etc. etc.
I do book conversion professionally, and mostly work with non-fiction economics books (lots of footnotes). Other types of books might be eaiser/faster, but If I want to completely proof a book and get a completed/finalized EPUB out of it, it takes me ~8-15 hours of work (although when I first started it used to take me ~2 weeks to convert a book).
I explained a lot of the method in here:
https://www.mobileread.com/forums/sho...d.php?t=223817
and in here:
https://www.mobileread.com/forums/sho...d.php?t=234146
I personally use ABBYY Finereader (because in my testing it has been the most accurate). But the same methods should apply no matter what OCR program you are using.