View Single Post
Old 07-17-2010, 08:30 AM   #34
nyrath
Addict
nyrath reads XML... blindfoldednyrath reads XML... blindfoldednyrath reads XML... blindfoldednyrath reads XML... blindfoldednyrath reads XML... blindfoldednyrath reads XML... blindfoldednyrath reads XML... blindfoldednyrath reads XML... blindfoldednyrath reads XML... blindfoldednyrath reads XML... blindfoldednyrath reads XML... blindfolded
 
nyrath's Avatar
 
Posts: 281
Karma: 52007
Join Date: Jun 2010
Device: nook
Based on recommendations from this forum, I got a Plustek optibook 3600. I've scanned four paperbacks so far, and it has worked reasonably well.

However, I have read reviews that suggest the bulb in the scanner tends to burn out quickly. Though those reviews were several years old.

The main problem I found is that some paperbacks print so close to the book spine that occasionally a couple of letters get clipped from the words. This is not a problem with hardbacks or larger books.

The bundled OCR package seems to work as well as the $100 OCR program I bought years ago (TextBridge Pro 9.0). About one mis-recognized word every four pages or so.

It saves all the scanned pages on your hard drive, so you could use another OCR program if you wish. I have a one terabyte external hard drive so space is not an issue. I scan grayscale 300 dpi TIFF format, so an entire paperback can take up 400 meg or so. Of course, once you've done OCR, you can delete all the TIFF files.

The time consuming part is the post production. I scan, use OCR, it loads it into Microsoft Word, and I save it as filtered HTML (I want to keep all the italic and bold formatting). I use a text processor (UltraEdit) to strip out all the <SPAN> tags, and turn all the <P attribute1="xxx", attribute2="xxx"... tags into <P> tags. I use Calibre to turn the HTML into ePub. Then I use Sigil to put <h1> tags on the chapter headings (which generates the table of contents), manually strip out the footers/headers that say NOVEL NAME page x, and manually correct any spelling mistakes.

I can get a paperback up to the Sigil step in an hour or two, but proofreading and correcting can take quite a long time.

Last edited by nyrath; 08-01-2010 at 07:20 PM.
nyrath is offline   Reply With Quote