MobileRead Forums - View Single Post

wayrad · 04-26-2010, 05:12 PM

It sounds like you need a basic understanding of the process before you worry too much about details of equipment and the final destination filetype.

OK, what you need to do, in brief, is 1) get pictures of your book pages - this is where the scanner comes in, and is probably 5% or less of the work, 2) "recognize" the letters and words in the image (OCR) and convert them to a text format that can be edited, searched, "read" by other programs, and converted to other text formats, 3) clean up, spellcheck, and proofread the file, 4) convert the file to your final format of choice, and 5) go back and fix all the problems that you missed before or that just popped up (this is practically inevitable).

You're at Step 1, and giving that file to Calibre (Step 4) is sort of like trying to feed a man a picture of a sandwich.

As far as specific tools, only a sheetfeed scanner requires removal of the binding. Your other options are a flatbed scanner (there is a specialized flatbed for books called the Opticbook, but ordinary flatbeds work), or digital photography. You can scan with whatever software your scanner came with and save page images in any format your OCR software will accept. You may have gotten an OCR package with your scanner - some even come with a basic Abbyy Finereader version called Sprint. If not, you'll need to buy an OCR package (I recommend Finereader). Once it has "recognized" the text, Finereader can save it to numerous formats, but quite a few of us like to save to Word because of its excellent search-and-replace capabilities; you can also use its spellchecker instead of FineReader's if you prefer. For final conversion, you've already discovered Calibre, and there are other specialized tools out there too.

Exact details of formats, software packages, and workflow details vary greatly from one person to another, so this is a very rough guide. It is important to remember that scanning is the easiest part of the job - some equipment is faster than others, and/or may give fewer OCR errors due to superior image quality, but there will be errors, and they will require painstaking, nitpicking, laborious proofreading. I can produce a book a week if I spend all my spare time on it, and even then it's not good enough to show anyone else, even if copyright law permitted.

Hope this helps.

P.S. One thing that may be causing confusion is that there are such things as "searchable PDFs", which contain information about the actual letters and words represented. That's usually because the PDF was made from a file that already had the information. With a nonsearchable PDF, your computer has no way of knowing whether the image represents War and Peace or a snapshot of your cat.

04-26-2010, 05:12 PM	#2
wayrad Fanatic Posts: 551 Karma: 1121392 Join Date: May 2008 Location: USA Device: HTC One M8	It sounds like you need a basic understanding of the process before you worry too much about details of equipment and the final destination filetype. OK, what you need to do, in brief, is 1) get pictures of your book pages - this is where the scanner comes in, and is probably 5% or less of the work, 2) "recognize" the letters and words in the image (OCR) and convert them to a text format that can be edited, searched, "read" by other programs, and converted to other text formats, 3) clean up, spellcheck, and proofread the file, 4) convert the file to your final format of choice, and 5) go back and fix all the problems that you missed before or that just popped up (this is practically inevitable). You're at Step 1, and giving that file to Calibre (Step 4) is sort of like trying to feed a man a picture of a sandwich. As far as specific tools, only a sheetfeed scanner requires removal of the binding. Your other options are a flatbed scanner (there is a specialized flatbed for books called the Opticbook, but ordinary flatbeds work), or digital photography. You can scan with whatever software your scanner came with and save page images in any format your OCR software will accept. You may have gotten an OCR package with your scanner - some even come with a basic Abbyy Finereader version called Sprint. If not, you'll need to buy an OCR package (I recommend Finereader). Once it has "recognized" the text, Finereader can save it to numerous formats, but quite a few of us like to save to Word because of its excellent search-and-replace capabilities; you can also use its spellchecker instead of FineReader's if you prefer. For final conversion, you've already discovered Calibre, and there are other specialized tools out there too. Exact details of formats, software packages, and workflow details vary greatly from one person to another, so this is a very rough guide. It is important to remember that scanning is the easiest part of the job - some equipment is faster than others, and/or may give fewer OCR errors due to superior image quality, but there will be errors, and they will require painstaking, nitpicking, laborious proofreading. I can produce a book a week if I spend all my spare time on it, and even then it's not good enough to show anyone else, even if copyright law permitted. Hope this helps. P.S. One thing that may be causing confusion is that there are such things as "searchable PDFs", which contain information about the actual letters and words represented. That's usually because the PDF was made from a file that already had the information. With a nonsearchable PDF, your computer has no way of knowing whether the image represents War and Peace or a snapshot of your cat. Last edited by wayrad; 04-26-2010 at 07:39 PM.