View Single Post
Old 08-04-2007, 11:38 PM   #1
rrm3
Junior Member
rrm3 began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Aug 2007
Location: Eugene, Oregon
Device: Sony PRS-500
Google Books -> eBook.

Hello there, I have not posted here before. I wanted to start documenting a good process to download a book from Google Books (which is scanned images) and prepare it for submission to pgdp.net using netpbm, imagemagick, tesseract, and abiword (all of which I believe can be made to run on most platforms).

The idea is for the professionals who are already doing this with free software to share how to do it quicker and better.

# first, convert the PDF into images files (one image per page).
$ pdftoppm Book.pdf Book

# next, remove any extraneous border from each image file.
$ for i in Book-*.ppm; do pnmcrop $i > `basename $i .ppm`-crop.pnm; done

# convert each image file into a format the ocr software likes.
$ for i in Book-*-crop.pnm; do pnmtotiff $i > `basename $i -crop.pnm`.tiff; done
$ for i in Book-*.tiff; do convert $i -colorspace GRAY -depth 8 `basename $i .tiff`-ocr.tiff; done

# run each image file through the ocr software.
$ for i in Book-*-ocr.tiff; do tesseract $i `basename $i -ocr.tiff`; done

# covert page images to png files.
$ for i in Book-*-ocr.tiff; do convert $i `basename $i -ocr.tiff`.png; done

# make directories for Guiprep.
$ mkdir text pngs
$ mv *.png pngs/
$ mv *.txt text/

This is as far as I've gotten so far. Nice to meet you.
rrm3 is offline   Reply With Quote