MobileRead Forums - View Single Post

ereszet · 09-24-2007, 08:06 AM

I share here my experience based on processing tens of thousands of paper documents into a digital form using my camera, my repro v-cradle setup of my design described earlier, and lots of free, demo, and commercial software.

The source of original paper books / documents images can be either a scanner, a digital camera or internet djvu/pdf files. I am discussing image pdf files rather than text pdf files. The best pdf format is "text under image" because you get exact image of the original page plus a text underneath that can be indexed and searched in your computer (alas no search is available in Sony Reader or other book readers that I know).

If the original images are of good quality, the only tool you need is pdflrf by cacapee. It will resize and rotate the pages, remove white background surrounding the text (if it is a clean background), and it will fatten the fonts to make them more outstanding when displayed by Sony Reader. It is extremely fast in comparison to any other program I know. As far as I understand, the most time consuming stage of pdf to lrf conversion is extracting images from pdf. Usually djvu to lrf (by pdflrf) is much faster. I believe that the fastest process would be to use scanned or camera images as a direct input to pdflrf, but I do not dare to ask cacapee to do that (I already requested him to add png images to the output for the reasons that I will explain later). Pdflrf is available for DOS with a nice Windows interface and for Ubuntu Linux. Djvu conversion is faster in Ubuntu but pdf conversion in Ubuntu is much slower (same algorithm according to cacapee, but apparently DOS image extraction implementation is better than the one in Ubuntu, plus pdflrf in Ubuntu complains frequently with warning messages about missing fonts end "blocks" while producing good results - why it complains about missing fonts if the input is image rather than text is beyond me).

Of course to convert your scanned or camera images to pdf or djvu you need a program that can do that. As there is a plethora of free programs both for Windows and Linux, I am not going to discuss that. Just note that you can print to pdf or djvu any document/image that can be browsed in a program that allows printing (you cannot print lrf though - a challenge to developers).

In a perfect world pdflrf would be all you need to get lrf files readable with your Sony Reader (a few hundred pages in your free evening time, including the time to photo scan the documents). But the input to pdflrf may be of poor quality either because you didn't take time to shoot the photos properly or you use poor quality google books pdfs or you get djvu files from digital libraries that are based on old microfilms. So you need some preprocessing.

I will concentrate now on camera images. The first stage is to get your images from a camera to the computer. You may do it with the software provided by the camera manufacturer or other developers, but I use Picasa offered free by google both for Windows and Linux. I use it also to automatically correct contrast and color in a batch of photos. Contrast is expecially important for further processing. I once took a number of document photos just by "shooting from the hip" in a dark hotel room. In the resulting images the text was hardly discernible from the background. Picasa took care of that.

If you have a lot of time and patience, you can use Picasa to correct other aspects of your photos (like cropping to remove background, deskewing, correcting white balance, etc.) but image after image rather than in a batch. It is manageable for a dozen of images or so but would be rather time consuming for a book with some hundred pages. Therefore we need other programs that can do additional processing in batch.

The one I use for further batch processing is the commercial Finereader 8 program. Basically it is for OCRing the images/pdf input (does not take djvu) but its preprocessing abilities ares very useful. With it you can split the double pages, adjust the resolution (from 96 or even 72 dpi to 600 dpi and more), convert to black and white (the algorithm is quite good), deskew the lines of text (the algorithm is rather poor), clean the small spots, and remove anything which is not text or images by saving only the recognized blocks (for batch saving it requires a special trick that I will describe in another post). Finally you can save all the pages to pdf ("text under image" or a number of other formats, but not djvu). I eagerly await the next version of Finereader, possibily with better cropping option, better deskewing and an option to save blocks of text and images in a batch process. Unfortunately, in my opinion, Finereader marketing policies keep them from issuing new versions fast (version 8 is more than 2 years old). Their developers are ingenious, but marketing people do not want my money for a new version (same with Canon - my Powershot Pro 1 is more then 2 years old).

That's all for now folks. My description of specialized software for processing poor images will follow soon. In the meantime, you can have a look (search the net) at the BookRestorer (very expensive, comes with commercial photo scanners) or ScanKromsator by bolega (Russian interface, free and powerful) or Snapter by Atiz (demo available - it is slow and not fit for real batch processing but it is fun to experiment with). Those of you who have any experience with free GIMP and ImageMagick or Adobe (commercial) Photoshop/Lightroom plus various plug-ins or similar heavy wieghts are welcome to share their experience. But remember the we are considering here speed, batch processing, ease of use, and a special application (not just improving family photos).

09-24-2007, 08:06 AM	#177
ereszet Zealot Posts: 118 Karma: 306 Join Date: Sep 2007 Device: Sony PRS-500 Archos 704 wifi	Software tools to convert paper documents to lrf with thanks to cacapee for pdflrf I share here my experience based on processing tens of thousands of paper documents into a digital form using my camera, my repro v-cradle setup of my design described earlier, and lots of free, demo, and commercial software. The source of original paper books / documents images can be either a scanner, a digital camera or internet djvu/pdf files. I am discussing image pdf files rather than text pdf files. The best pdf format is "text under image" because you get exact image of the original page plus a text underneath that can be indexed and searched in your computer (alas no search is available in Sony Reader or other book readers that I know). If the original images are of good quality, the only tool you need is pdflrf by cacapee. It will resize and rotate the pages, remove white background surrounding the text (if it is a clean background), and it will fatten the fonts to make them more outstanding when displayed by Sony Reader. It is extremely fast in comparison to any other program I know. As far as I understand, the most time consuming stage of pdf to lrf conversion is extracting images from pdf. Usually djvu to lrf (by pdflrf) is much faster. I believe that the fastest process would be to use scanned or camera images as a direct input to pdflrf, but I do not dare to ask cacapee to do that (I already requested him to add png images to the output for the reasons that I will explain later). Pdflrf is available for DOS with a nice Windows interface and for Ubuntu Linux. Djvu conversion is faster in Ubuntu but pdf conversion in Ubuntu is much slower (same algorithm according to cacapee, but apparently DOS image extraction implementation is better than the one in Ubuntu, plus pdflrf in Ubuntu complains frequently with warning messages about missing fonts end "blocks" while producing good results - why it complains about missing fonts if the input is image rather than text is beyond me). Of course to convert your scanned or camera images to pdf or djvu you need a program that can do that. As there is a plethora of free programs both for Windows and Linux, I am not going to discuss that. Just note that you can print to pdf or djvu any document/image that can be browsed in a program that allows printing (you cannot print lrf though - a challenge to developers). In a perfect world pdflrf would be all you need to get lrf files readable with your Sony Reader (a few hundred pages in your free evening time, including the time to photo scan the documents). But the input to pdflrf may be of poor quality either because you didn't take time to shoot the photos properly or you use poor quality google books pdfs or you get djvu files from digital libraries that are based on old microfilms. So you need some preprocessing. I will concentrate now on camera images. The first stage is to get your images from a camera to the computer. You may do it with the software provided by the camera manufacturer or other developers, but I use Picasa offered free by google both for Windows and Linux. I use it also to automatically correct contrast and color in a batch of photos. Contrast is expecially important for further processing. I once took a number of document photos just by "shooting from the hip" in a dark hotel room. In the resulting images the text was hardly discernible from the background. Picasa took care of that. If you have a lot of time and patience, you can use Picasa to correct other aspects of your photos (like cropping to remove background, deskewing, correcting white balance, etc.) but image after image rather than in a batch. It is manageable for a dozen of images or so but would be rather time consuming for a book with some hundred pages. Therefore we need other programs that can do additional processing in batch. The one I use for further batch processing is the commercial Finereader 8 program. Basically it is for OCRing the images/pdf input (does not take djvu) but its preprocessing abilities ares very useful. With it you can split the double pages, adjust the resolution (from 96 or even 72 dpi to 600 dpi and more), convert to black and white (the algorithm is quite good), deskew the lines of text (the algorithm is rather poor), clean the small spots, and remove anything which is not text or images by saving only the recognized blocks (for batch saving it requires a special trick that I will describe in another post). Finally you can save all the pages to pdf ("text under image" or a number of other formats, but not djvu). I eagerly await the next version of Finereader, possibily with better cropping option, better deskewing and an option to save blocks of text and images in a batch process. Unfortunately, in my opinion, Finereader marketing policies keep them from issuing new versions fast (version 8 is more than 2 years old). Their developers are ingenious, but marketing people do not want my money for a new version (same with Canon - my Powershot Pro 1 is more then 2 years old). That's all for now folks. My description of specialized software for processing poor images will follow soon. In the meantime, you can have a look (search the net) at the BookRestorer (very expensive, comes with commercial photo scanners) or ScanKromsator by bolega (Russian interface, free and powerful) or Snapter by Atiz (demo available - it is slow and not fit for real batch processing but it is fun to experiment with). Those of you who have any experience with free GIMP and ImageMagick or Adobe (commercial) Photoshop/Lightroom plus various plug-ins or similar heavy wieghts are welcome to share their experience. But remember the we are considering here speed, batch processing, ease of use, and a special application (not just improving family photos).