MobileRead Forums - View Single Post - Google Books -> eBook.

rrm3 · 08-04-2007, 11:38 PM

Hello there, I have not posted here before. I wanted to start documenting a good process to download a book from Google Books (which is scanned images) and prepare it for submission to pgdp.net using netpbm, imagemagick, tesseract, and abiword (all of which I believe can be made to run on most platforms).

The idea is for the professionals who are already doing this with free software to share how to do it quicker and better.

# first, convert the PDF into images files (one image per page).
$ pdftoppm Book.pdf Book

# next, remove any extraneous border from each image file.
$ for i in Book-*.ppm; do pnmcrop $i > `basename $i .ppm`-crop.pnm; done

# convert each image file into a format the ocr software likes.
$ for i in Book-*-crop.pnm; do pnmtotiff $i > `basename $i -crop.pnm`.tiff; done
$ for i in Book-*.tiff; do convert $i -colorspace GRAY -depth 8 `basename $i .tiff`-ocr.tiff; done

# run each image file through the ocr software.
$ for i in Book-*-ocr.tiff; do tesseract $i `basename $i -ocr.tiff`; done

# covert page images to png files.
$ for i in Book-*-ocr.tiff; do convert $i `basename $i -ocr.tiff`.png; done

# make directories for Guiprep.
$ mkdir text pngs
$ mv *.png pngs/
$ mv *.txt text/

This is as far as I've gotten so far. Nice to meet you.

08-04-2007, 11:38 PM	#1
rrm3 Junior Member Posts: 4 Karma: 10 Join Date: Aug 2007 Location: Eugene, Oregon Device: Sony PRS-500	Google Books -> eBook. Hello there, I have not posted here before. I wanted to start documenting a good process to download a book from Google Books (which is scanned images) and prepare it for submission to pgdp.net using netpbm, imagemagick, tesseract, and abiword (all of which I believe can be made to run on most platforms). The idea is for the professionals who are already doing this with free software to share how to do it quicker and better. # first, convert the PDF into images files (one image per page). $ pdftoppm Book.pdf Book # next, remove any extraneous border from each image file. $ for i in Book-.ppm; do pnmcrop $i > `basename $i .ppm`-crop.pnm; done # convert each image file into a format the ocr software likes. $ for i in Book--crop.pnm; do pnmtotiff $i > `basename $i -crop.pnm`.tiff; done $ for i in Book-.tiff; do convert $i -colorspace GRAY -depth 8 `basename $i .tiff`-ocr.tiff; done # run each image file through the ocr software. $ for i in Book--ocr.tiff; do tesseract $i `basename $i -ocr.tiff`; done # covert page images to png files. $ for i in Book--ocr.tiff; do convert $i `basename $i -ocr.tiff`.png; done # make directories for Guiprep. $ mkdir text pngs $ mv .png pngs/ $ mv *.txt text/ This is as far as I've gotten so far. Nice to meet you.