![]() |
#1 |
Junior Member
![]() Posts: 4
Karma: 10
Join Date: Aug 2007
Location: Eugene, Oregon
Device: Sony PRS-500
|
Google Books -> eBook.
Hello there, I have not posted here before. I wanted to start documenting a good process to download a book from Google Books (which is scanned images) and prepare it for submission to pgdp.net using netpbm, imagemagick, tesseract, and abiword (all of which I believe can be made to run on most platforms).
The idea is for the professionals who are already doing this with free software to share how to do it quicker and better. # first, convert the PDF into images files (one image per page). $ pdftoppm Book.pdf Book # next, remove any extraneous border from each image file. $ for i in Book-*.ppm; do pnmcrop $i > `basename $i .ppm`-crop.pnm; done # convert each image file into a format the ocr software likes. $ for i in Book-*-crop.pnm; do pnmtotiff $i > `basename $i -crop.pnm`.tiff; done $ for i in Book-*.tiff; do convert $i -colorspace GRAY -depth 8 `basename $i .tiff`-ocr.tiff; done # run each image file through the ocr software. $ for i in Book-*-ocr.tiff; do tesseract $i `basename $i -ocr.tiff`; done # covert page images to png files. $ for i in Book-*-ocr.tiff; do convert $i `basename $i -ocr.tiff`.png; done # make directories for Guiprep. $ mkdir text pngs $ mv *.png pngs/ $ mv *.txt text/ This is as far as I've gotten so far. Nice to meet you. |
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,201
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Does tesseract preserve text formating (bold, italic). Does it extract images (regions it cannot interpret as text) and if it does, does it preserve the position of the image on the page?
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Junior Member
![]() Posts: 4
Karma: 10
Join Date: Aug 2007
Location: Eugene, Oregon
Device: Sony PRS-500
|
no, tesseract does not. I think that is/will be implemented in ocropus (also hosted on google code). I have not tried it yet though.
|
![]() |
![]() |
![]() |
#4 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,201
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
then you're better off using pdftohtml, atleast for text based pdfs. Though I guess this is still useful for scan based PDFs like the google books. I'm surprised google doesn't offer an OCRed version.
Last edited by kovidgoyal; 08-05-2007 at 02:18 AM. |
![]() |
![]() |
![]() |
#5 |
Junior Member
![]() Posts: 4
Karma: 10
Join Date: Aug 2007
Location: Eugene, Oregon
Device: Sony PRS-500
|
err... right. It would sure be nice if they did. They were the whole reason I ordered this thing a few days ago. It was so frustrating to find out that I couldn't view them on my little reader.
|
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Grand Arbiter
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 447
Karma: 1574837
Join Date: Oct 2007
Location: Arizona
Device: iPod Touch, Amazon Kindle, Motorola Droid
|
Correct me if I'm wrong, but I thought the only books you can download as PDFs from Google are public domain books, which are offered as free ePub files in Sony's store. If that's the case, why would you even want to bother with the PDFs?
|
![]() |
![]() |
![]() |
#7 |
the snarky blue one
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,001
Karma: 3877825
Join Date: Mar 2009
Location: deep in the heart
Device: PRS500, 505 & 600, PRST1 & T2, Kindle PW, Moto Razr, Galaxy Tab 2-10"
|
Forgive me if I'm my understanding of all of this is wrong, but the previous posts about Google PDFs were made over a year ago. I don't think Google's public domain ePub books through Sony's eBook Store even existed back then.
|
![]() |
![]() |
![]() |
#8 |
Grand Arbiter
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 447
Karma: 1574837
Join Date: Oct 2007
Location: Arizona
Device: iPod Touch, Amazon Kindle, Motorola Droid
|
Ah, I didn't notice that. I have no idea how I came across this thread. Even so, it seems like it would make more sense just to get the books from Project Gutenberg.
|
![]() |
![]() |
![]() |
#9 |
the snarky blue one
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,001
Karma: 3877825
Join Date: Mar 2009
Location: deep in the heart
Device: PRS500, 505 & 600, PRST1 & T2, Kindle PW, Moto Razr, Galaxy Tab 2-10"
|
|
![]() |
![]() |
![]() |
#10 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,395
Karma: 1358132
Join Date: Nov 2007
Location: UK
Device: Palm TX, CyBook Gen3
|
|
![]() |
![]() |
![]() |
#11 |
the snarky blue one
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,001
Karma: 3877825
Join Date: Mar 2009
Location: deep in the heart
Device: PRS500, 505 & 600, PRST1 & T2, Kindle PW, Moto Razr, Galaxy Tab 2-10"
|
|
![]() |
![]() |
![]() |
#12 | |||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,395
Karma: 1358132
Join Date: Nov 2007
Location: UK
Device: Palm TX, CyBook Gen3
|
Quote:
Quote:
Quote:
Or am I totally misunderstanding? ![]() |
|||
![]() |
![]() |
![]() |
#13 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,627
Karma: 406616
Join Date: Dec 2008
Location: Northern Virginia
Device: SurfacePro, SurfaceBook 2
|
Quote:
![]() |
|
![]() |
![]() |
![]() |
#14 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,395
Karma: 1358132
Join Date: Nov 2007
Location: UK
Device: Palm TX, CyBook Gen3
|
|
![]() |
![]() |
![]() |
#15 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,627
Karma: 406616
Join Date: Dec 2008
Location: Northern Virginia
Device: SurfacePro, SurfaceBook 2
|
Quote:
![]() ![]() It's actually the second post like that that I've run across since I started welcoming new members. Now I check the date every time! ![]() |
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Google Books Offers ePub Downloads Of Free Books | AprilHare | News | 19 | 05-17-2011 10:02 PM |
Sony and google books, anyway to bulk download all free books? | Student1 | Calibre | 18 | 05-28-2009 09:29 PM |
Sony, Google and Barnes & Noble To Partner For Sales of Google Books [April Fools] | NatCh | News | 73 | 04-07-2009 08:48 AM |
New York Review of Books Article on Google Books | BenG | News | 2 | 01-26-2009 05:50 PM |
Missing features: Gutenberg, Google Books, Google News, open RSS aggregator | Charbax | Amazon Kindle | 10 | 11-22-2007 08:22 PM |