Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > Miscellaneous > Introduce Yourself

Notices

Reply
 
Thread Tools Search this Thread
Old 08-04-2007, 11:38 PM   #1
rrm3
Junior Member
rrm3 began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Aug 2007
Location: Eugene, Oregon
Device: Sony PRS-500
Google Books -> eBook.

Hello there, I have not posted here before. I wanted to start documenting a good process to download a book from Google Books (which is scanned images) and prepare it for submission to pgdp.net using netpbm, imagemagick, tesseract, and abiword (all of which I believe can be made to run on most platforms).

The idea is for the professionals who are already doing this with free software to share how to do it quicker and better.

# first, convert the PDF into images files (one image per page).
$ pdftoppm Book.pdf Book

# next, remove any extraneous border from each image file.
$ for i in Book-*.ppm; do pnmcrop $i > `basename $i .ppm`-crop.pnm; done

# convert each image file into a format the ocr software likes.
$ for i in Book-*-crop.pnm; do pnmtotiff $i > `basename $i -crop.pnm`.tiff; done
$ for i in Book-*.tiff; do convert $i -colorspace GRAY -depth 8 `basename $i .tiff`-ocr.tiff; done

# run each image file through the ocr software.
$ for i in Book-*-ocr.tiff; do tesseract $i `basename $i -ocr.tiff`; done

# covert page images to png files.
$ for i in Book-*-ocr.tiff; do convert $i `basename $i -ocr.tiff`.png; done

# make directories for Guiprep.
$ mkdir text pngs
$ mv *.png pngs/
$ mv *.txt text/

This is as far as I've gotten so far. Nice to meet you.
rrm3 is offline   Reply With Quote
Old 08-05-2007, 01:37 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,856
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Does tesseract preserve text formating (bold, italic). Does it extract images (regions it cannot interpret as text) and if it does, does it preserve the position of the image on the page?
kovidgoyal is offline   Reply With Quote
Advert
Old 08-05-2007, 02:01 AM   #3
rrm3
Junior Member
rrm3 began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Aug 2007
Location: Eugene, Oregon
Device: Sony PRS-500
no, tesseract does not. I think that is/will be implemented in ocropus (also hosted on google code). I have not tried it yet though.
rrm3 is offline   Reply With Quote
Old 08-05-2007, 02:03 AM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,856
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
then you're better off using pdftohtml, atleast for text based pdfs. Though I guess this is still useful for scan based PDFs like the google books. I'm surprised google doesn't offer an OCRed version.

Last edited by kovidgoyal; 08-05-2007 at 02:18 AM.
kovidgoyal is offline   Reply With Quote
Old 08-05-2007, 02:30 AM   #5
rrm3
Junior Member
rrm3 began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Aug 2007
Location: Eugene, Oregon
Device: Sony PRS-500
err... right. It would sure be nice if they did. They were the whole reason I ordered this thing a few days ago. It was so frustrating to find out that I couldn't view them on my little reader.
rrm3 is offline   Reply With Quote
Advert
Old 07-24-2009, 11:11 AM   #6
SpiderMatt
Grand Arbiter
SpiderMatt ought to be getting tired of karma fortunes by now.SpiderMatt ought to be getting tired of karma fortunes by now.SpiderMatt ought to be getting tired of karma fortunes by now.SpiderMatt ought to be getting tired of karma fortunes by now.SpiderMatt ought to be getting tired of karma fortunes by now.SpiderMatt ought to be getting tired of karma fortunes by now.SpiderMatt ought to be getting tired of karma fortunes by now.SpiderMatt ought to be getting tired of karma fortunes by now.SpiderMatt ought to be getting tired of karma fortunes by now.SpiderMatt ought to be getting tired of karma fortunes by now.SpiderMatt ought to be getting tired of karma fortunes by now.
 
SpiderMatt's Avatar
 
Posts: 447
Karma: 1574837
Join Date: Oct 2007
Location: Arizona
Device: iPod Touch, Amazon Kindle, Motorola Droid
Correct me if I'm wrong, but I thought the only books you can download as PDFs from Google are public domain books, which are offered as free ePub files in Sony's store. If that's the case, why would you even want to bother with the PDFs?
SpiderMatt is offline   Reply With Quote
Old 07-24-2009, 01:58 PM   #7
Lady Blue
the snarky blue one
Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.
 
Lady Blue's Avatar
 
Posts: 6,001
Karma: 3877825
Join Date: Mar 2009
Location: deep in the heart
Device: PRS500, 505 & 600, PRST1 & T2, Kindle PW, Moto Razr, Galaxy Tab 2-10"
Quote:
Originally Posted by SpiderMatt View Post
Correct me if I'm wrong, but I thought the only books you can download as PDFs from Google are public domain books, which are offered as free ePub files in Sony's store. If that's the case, why would you even want to bother with the PDFs?
Forgive me if I'm my understanding of all of this is wrong, but the previous posts about Google PDFs were made over a year ago. I don't think Google's public domain ePub books through Sony's eBook Store even existed back then.
Lady Blue is offline   Reply With Quote
Old 07-26-2009, 09:34 AM   #8
SpiderMatt
Grand Arbiter
SpiderMatt ought to be getting tired of karma fortunes by now.SpiderMatt ought to be getting tired of karma fortunes by now.SpiderMatt ought to be getting tired of karma fortunes by now.SpiderMatt ought to be getting tired of karma fortunes by now.SpiderMatt ought to be getting tired of karma fortunes by now.SpiderMatt ought to be getting tired of karma fortunes by now.SpiderMatt ought to be getting tired of karma fortunes by now.SpiderMatt ought to be getting tired of karma fortunes by now.SpiderMatt ought to be getting tired of karma fortunes by now.SpiderMatt ought to be getting tired of karma fortunes by now.SpiderMatt ought to be getting tired of karma fortunes by now.
 
SpiderMatt's Avatar
 
Posts: 447
Karma: 1574837
Join Date: Oct 2007
Location: Arizona
Device: iPod Touch, Amazon Kindle, Motorola Droid
Ah, I didn't notice that. I have no idea how I came across this thread. Even so, it seems like it would make more sense just to get the books from Project Gutenberg.
SpiderMatt is offline   Reply With Quote
Old 07-26-2009, 01:54 PM   #9
Lady Blue
the snarky blue one
Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.
 
Lady Blue's Avatar
 
Posts: 6,001
Karma: 3877825
Join Date: Mar 2009
Location: deep in the heart
Device: PRS500, 505 & 600, PRST1 & T2, Kindle PW, Moto Razr, Galaxy Tab 2-10"
Quote:
Originally Posted by SpiderMatt View Post
Ah, I didn't notice that. I have no idea how I came across this thread. Even so, it seems like it would make more sense just to get the books from Project Gutenberg.



Good point.
Lady Blue is offline   Reply With Quote
Old 07-26-2009, 02:26 PM   #10
Sparrow
Wizard
Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.
 
Posts: 4,395
Karma: 1358132
Join Date: Nov 2007
Location: UK
Device: Palm TX, CyBook Gen3
Quote:
Originally Posted by SpiderMatt View Post
... it seems like it would make more sense just to get the books from Project Gutenberg.
But, not all the books on Google Books (or Internet Archive) are available on Project Gutenberg.
Sparrow is offline   Reply With Quote
Old 07-27-2009, 07:33 AM   #11
Lady Blue
the snarky blue one
Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.Lady Blue ought to be getting tired of karma fortunes by now.
 
Lady Blue's Avatar
 
Posts: 6,001
Karma: 3877825
Join Date: Mar 2009
Location: deep in the heart
Device: PRS500, 505 & 600, PRST1 & T2, Kindle PW, Moto Razr, Galaxy Tab 2-10"
Quote:
Originally Posted by Sparrow View Post
But, not all the books on Google Books (or Internet Archive) are available on Project Gutenberg.


. . . and vice versa.

It goes without saying that not all books are available from all sources (free or not.)
Lady Blue is offline   Reply With Quote
Old 07-27-2009, 09:44 AM   #12
Sparrow
Wizard
Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.
 
Posts: 4,395
Karma: 1358132
Join Date: Nov 2007
Location: UK
Device: Palm TX, CyBook Gen3
Quote:
Originally Posted by Lady Blue View Post
. . . and vice versa.
It goes without saying that not all books are available from all sources (free or not.)
Yes, but the OP wrote:

Quote:
Originally Posted by rrm3 View Post
I wanted to start documenting a good process to download a book from Google Books...
and then SpiderMatt wrote:

Quote:
Originally Posted by SpiderMatt View Post
..it seems like it would make more sense just to get the books from Project Gutenberg.
But that wouldn't necessarily solve the problem - if the text is in Google Books, but not Project Gutenberg.

Or am I totally misunderstanding?
Sparrow is offline   Reply With Quote
Old 07-27-2009, 01:11 PM   #13
kazbates
Wizard
kazbates ought to be getting tired of karma fortunes by now.kazbates ought to be getting tired of karma fortunes by now.kazbates ought to be getting tired of karma fortunes by now.kazbates ought to be getting tired of karma fortunes by now.kazbates ought to be getting tired of karma fortunes by now.kazbates ought to be getting tired of karma fortunes by now.kazbates ought to be getting tired of karma fortunes by now.kazbates ought to be getting tired of karma fortunes by now.kazbates ought to be getting tired of karma fortunes by now.kazbates ought to be getting tired of karma fortunes by now.kazbates ought to be getting tired of karma fortunes by now.
 
kazbates's Avatar
 
Posts: 2,627
Karma: 406616
Join Date: Dec 2008
Location: Northern Virginia
Device: SurfacePro, SurfaceBook 2
Quote:
Originally Posted by Sparrow View Post
Yes, but the OP wrote:



and then SpiderMatt wrote:



But that wouldn't necessarily solve the problem - if the text is in Google Books, but not Project Gutenberg.

Or am I totally misunderstanding?
Yes! You're missing that the original post is almost 2 years old and the original poster has probably moved on!!
kazbates is offline   Reply With Quote
Old 07-27-2009, 02:19 PM   #14
Sparrow
Wizard
Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.Sparrow ought to be getting tired of karma fortunes by now.
 
Posts: 4,395
Karma: 1358132
Join Date: Nov 2007
Location: UK
Device: Palm TX, CyBook Gen3
Quote:
Originally Posted by kazbates View Post
Yes! You're missing that the original post is almost 2 years old and the original poster has probably moved on!!
- We should have different background colours for elderly posts - I never check the dates.
Sparrow is offline   Reply With Quote
Old 07-27-2009, 10:03 PM   #15
kazbates
Wizard
kazbates ought to be getting tired of karma fortunes by now.kazbates ought to be getting tired of karma fortunes by now.kazbates ought to be getting tired of karma fortunes by now.kazbates ought to be getting tired of karma fortunes by now.kazbates ought to be getting tired of karma fortunes by now.kazbates ought to be getting tired of karma fortunes by now.kazbates ought to be getting tired of karma fortunes by now.kazbates ought to be getting tired of karma fortunes by now.kazbates ought to be getting tired of karma fortunes by now.kazbates ought to be getting tired of karma fortunes by now.kazbates ought to be getting tired of karma fortunes by now.
 
kazbates's Avatar
 
Posts: 2,627
Karma: 406616
Join Date: Dec 2008
Location: Northern Virginia
Device: SurfacePro, SurfaceBook 2
Quote:
Originally Posted by Sparrow View Post
- We should have different background colours for elderly posts - I never check the dates.
I started to type a welcome message and for some reason happened to look at the posting date. I guess my MR guardian angel was looking over my shoulder!

It's actually the second post like that that I've run across since I started welcoming new members. Now I check the date every time!
kazbates is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Google Books Offers ePub Downloads Of Free Books AprilHare News 19 05-17-2011 10:02 PM
Sony and google books, anyway to bulk download all free books? Student1 Calibre 18 05-28-2009 09:29 PM
Sony, Google and Barnes & Noble To Partner For Sales of Google Books [April Fools] NatCh News 73 04-07-2009 08:48 AM
New York Review of Books Article on Google Books BenG News 2 01-26-2009 05:50 PM
Missing features: Gutenberg, Google Books, Google News, open RSS aggregator Charbax Amazon Kindle 10 11-22-2007 08:22 PM


All times are GMT -4. The time now is 07:51 PM.


MobileRead.com is a privately owned, operated and funded community.