Shiny New E-Book Gizmo: The Amazon Kindle


View Full Version : Google Books -> eBook.


rrm3
08-04-2007, 10:38 PM
Hello there, I have not posted here before. I wanted to start documenting a good process to download a book from Google Books (which is scanned images) and prepare it for submission to pgdp.net using netpbm, imagemagick, tesseract, and abiword (all of which I believe can be made to run on most platforms).

The idea is for the professionals who are already doing this with free software to share how to do it quicker and better.

# first, convert the PDF into images files (one image per page).
$ pdftoppm Book.pdf Book

# next, remove any extraneous border from each image file.
$ for i in Book-*.ppm; do pnmcrop $i > `basename $i .ppm`-crop.pnm; done

# convert each image file into a format the ocr software likes.
$ for i in Book-*-crop.pnm; do pnmtotiff $i > `basename $i -crop.pnm`.tiff; done
$ for i in Book-*.tiff; do convert $i -colorspace GRAY -depth 8 `basename $i .tiff`-ocr.tiff; done

# run each image file through the ocr software.
$ for i in Book-*-ocr.tiff; do tesseract $i `basename $i -ocr.tiff`; done

# covert page images to png files.
$ for i in Book-*-ocr.tiff; do convert $i `basename $i -ocr.tiff`.png; done

# make directories for Guiprep.
$ mkdir text pngs
$ mv *.png pngs/
$ mv *.txt text/

This is as far as I've gotten so far. Nice to meet you.

kovidgoyal
08-05-2007, 12:37 AM
Does tesseract preserve text formating (bold, italic). Does it extract images (regions it cannot interpret as text) and if it does, does it preserve the position of the image on the page?

rrm3
08-05-2007, 01:01 AM
no, tesseract does not. I think that is/will be implemented in ocropus (also hosted on google code). I have not tried it yet though.

kovidgoyal
08-05-2007, 01:03 AM
then you're better off using pdftohtml, atleast for text based pdfs. Though I guess this is still useful for scan based PDFs like the google books. I'm surprised google doesn't offer an OCRed version.

rrm3
08-05-2007, 01:30 AM
err... right. It would sure be nice if they did. They were the whole reason I ordered this thing a few days ago. It was so frustrating to find out that I couldn't view them on my little reader.