Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book General > News

Notices

Reply
 
Thread Tools Search this Thread
Old 09-30-2009, 06:13 PM   #1
gazza
Member
gazza began at the beginning.
 
Posts: 10
Karma: 15
Join Date: Sep 2009
Device: iPod Touch
Scanning in your own books

Hi,

I have a very, very large library of traditional books. My wife, because of illness, can only use the iPod Touch although I am sure other equally small and suitable devices will come along.
I work as a journalist at very strange hours but often have short breaks of an hour or so.
The thought came -- why not scan in our own library and then we will have no problems with DRM?
The scanner we think is the answer is the Plustek OpticBook 3600 Plus which is a tad slow but is very affordable -- say US $300 -- and outputs text in an OCR readable format.

The problem with every digital book I have seen -- including those I paid serious money for -- is the proof reading, the shining exception being Gutenberg. Google is a total farce. When a book has more than 1,000 errors something is seriously wrong.

So my idea is to scan in a book a day and proof-read -- a pleasant hobby -- into .txt format from which it can be used in many formats.
Does any reader have experience with this sort of thing and are there any major snags?

Gareth Powell in Sydney where it looks like we have a major storm blowing up
gazza is offline   Reply With Quote
Old 09-30-2009, 06:37 PM   #2
Moejoe
Banned
Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.
 
Posts: 5,110
Karma: 72193
Join Date: Feb 2009
Location: South of the Border
Device: Coffin
I scanned in my library of Banana Yoshimoto novels (Japanese author unavailable in ebook format).

It all depends on the OCR software you use. I did the Yoshimoto on Windows 7 using ABBYY Fine Reader 9.0 to do the scanning (directly to HTML which is the best format for archiving). I then took the HTML and converted to ePub in Calibre (free here on MR and programmed by the awesome Kovid Goyal). Recently I've been using Sigil (found here on MR again) to tidy up any obvious mistakes in the ePubs (Sigil edits epub directly).

Using Abbyy Finereader the recognition is very very good and is adjustable. The process was quite slow on my scanner (two pages maximum at a time) but worth it in the end. Took me about three or four hours to scan and do any corrections as I went for each novel (dependent on length).

I tried Readiris on the Mac but it wasn't great, and as yet I haven't found a great solution on Linux (if someone knows of one, please let me know).
Moejoe is offline   Reply With Quote
Old 09-30-2009, 06:38 PM   #3
schmolch
Connoisseur
schmolch knows what time it isschmolch knows what time it isschmolch knows what time it isschmolch knows what time it isschmolch knows what time it isschmolch knows what time it isschmolch knows what time it isschmolch knows what time it isschmolch knows what time it isschmolch knows what time it isschmolch knows what time it is
 
Posts: 88
Karma: 2394
Join Date: Jul 2009
Location: Germany
Device: Kindle
That sounds like alot of work.
Scanning individual book-pages is a very tedious thing to do because you cant do anything between the scans and so you are forced to waste alot of time.
Then you want to ocr and proof-read every book, that sounds like even more work.

I personally dont care about the physical books, i just cut the whole thing and put it into the document feeder. I also dont bother about ocr and just leave it as picture and make a pdf out of it. The disadvantage is a bigger filesize and the troubles that come with PDF (on small readers) but it saves a ton of time.

Since you are looking for electronic versions of books you already own, it would probably be legal (at least using common sense) if you look for these books on the internet.
schmolch is offline   Reply With Quote
Old 09-30-2009, 06:41 PM   #4
Moejoe
Banned
Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.
 
Posts: 5,110
Karma: 72193
Join Date: Feb 2009
Location: South of the Border
Device: Coffin
Quote:
Originally Posted by schmolch View Post
That sounds like alot of work.
Scanning individual book-pages is a very tedious thing to do because you cant do anything between the scans and so you are forced to waste alot of time.
Then you want to ocr and proof-read every book, that sounds like even more work.

I personally dont care about the physical books, i just cut the whole thing and put it into the document feeder. I also dont bother about ocr and just leave it as picture and make a pdf out of it. The disadvantage is a bigger filesize and the troubles that come with PDF (on small readers) but it saves a ton of time.

Since you are looking for electronic versions of books you already own, it would probably be legal (at least using common sense) if you look for these books on the internet.
That's actually a good idea because ABBYY can scan a PDF into HTML/TXT/RTF etc. So if you have a sheetfeeder the above suggestion is sound. Whip it through the sheetfeeder, output a PDF, then run it through ABBYY at the end

Now if only I could find an OCR on 'buntu or mac that worked as well as ABBYY.

EDIT: Or find someone online whose sharing the book (a lot of what I'm scanning there's no chance of that) and download from them as above poster stated
Moejoe is offline   Reply With Quote
Old 09-30-2009, 06:47 PM   #5
dmaul1114
Wizard
dmaul1114 ought to be getting tired of karma fortunes by now.dmaul1114 ought to be getting tired of karma fortunes by now.dmaul1114 ought to be getting tired of karma fortunes by now.dmaul1114 ought to be getting tired of karma fortunes by now.dmaul1114 ought to be getting tired of karma fortunes by now.dmaul1114 ought to be getting tired of karma fortunes by now.dmaul1114 ought to be getting tired of karma fortunes by now.dmaul1114 ought to be getting tired of karma fortunes by now.dmaul1114 ought to be getting tired of karma fortunes by now.dmaul1114 ought to be getting tired of karma fortunes by now.dmaul1114 ought to be getting tired of karma fortunes by now.
 
Posts: 2,300
Karma: 1121709
Join Date: Feb 2009
Device: Amazon Kindle 1
Not something I'd ever do. I don't have the time for it and would prefer to just read the paper book if there wasn't a good e-book version available.

But to each their own, if it's a hobby you enjoy more power to you.
dmaul1114 is offline   Reply With Quote
Old 09-30-2009, 07:01 PM   #6
edembowski
Zealot
edembowski has a complete set of Star Wars action figures.edembowski has a complete set of Star Wars action figures.edembowski has a complete set of Star Wars action figures.edembowski has a complete set of Star Wars action figures.
 
edembowski's Avatar
 
Posts: 138
Karma: 372
Join Date: Apr 2008
Location: New York, NY
Device: Sony PRS-600, Nook Color, iPad
Scanning is actually very quick and easy if you build your own rig. Take a look at http://www.diybookscanner.org

If you really want to do it, this is probably the fastest way. For a ~400 page book, it takes me about an hour to scan and import the book, then another couple of hours to proof read the book. Depending on the book and how accurately you want to preserve it, it may take longer to proof. Things like typeface, font weight, indentation all take a little longer to make sure of.

- Ed
edembowski is offline   Reply With Quote
Old 09-30-2009, 07:32 PM   #7
Hellmark
Wizard
Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.
 
Hellmark's Avatar
 
Posts: 2,521
Karma: 3638167
Join Date: Jun 2009
Location: Maryland Heights, Missouri, USA
Device: Nokia N800, PRS-505, Nook STR Glowlight
Not too long ago, I saw on Make Magazine's blog, and I think Hack a Day, about someone that made an apparatus for their scanner, to scan books in. May want to look into something like that.
Hellmark is offline   Reply With Quote
Old 09-30-2009, 07:39 PM   #8
edembowski
Zealot
edembowski has a complete set of Star Wars action figures.edembowski has a complete set of Star Wars action figures.edembowski has a complete set of Star Wars action figures.edembowski has a complete set of Star Wars action figures.
 
edembowski's Avatar
 
Posts: 138
Karma: 372
Join Date: Apr 2008
Location: New York, NY
Device: Sony PRS-600, Nook Color, iPad
The Make blog entry pointed to the Instructable written by the guy who made http://www.diybookscanner.org :-)

Even if you don't build one, it's worth a look at how they put together their setups. A lot of people have posted their hardware designs as well as custom processing software.

- Ed
edembowski is offline   Reply With Quote
Old 09-30-2009, 07:46 PM   #9
Hellmark
Wizard
Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.
 
Hellmark's Avatar
 
Posts: 2,521
Karma: 3638167
Join Date: Jun 2009
Location: Maryland Heights, Missouri, USA
Device: Nokia N800, PRS-505, Nook STR Glowlight
I actually have seen a few lately. One was even made with a LEGO kit.
Hellmark is offline   Reply With Quote
Old 09-30-2009, 07:50 PM   #10
luqmaninbmore
Da'i
luqmaninbmore ought to be getting tired of karma fortunes by now.luqmaninbmore ought to be getting tired of karma fortunes by now.luqmaninbmore ought to be getting tired of karma fortunes by now.luqmaninbmore ought to be getting tired of karma fortunes by now.luqmaninbmore ought to be getting tired of karma fortunes by now.luqmaninbmore ought to be getting tired of karma fortunes by now.luqmaninbmore ought to be getting tired of karma fortunes by now.luqmaninbmore ought to be getting tired of karma fortunes by now.luqmaninbmore ought to be getting tired of karma fortunes by now.luqmaninbmore ought to be getting tired of karma fortunes by now.luqmaninbmore ought to be getting tired of karma fortunes by now.
 
luqmaninbmore's Avatar
 
Posts: 1,143
Karma: 1217499
Join Date: Oct 2008
Location: Baltimore
Device: Toshiba Thrive, Kobo Touch, Kindle 1, Aluratek Libre, T-Mobile Comet
Quote:
Originally Posted by Moejoe View Post
That's actually a good idea because ABBYY can scan a PDF into HTML/TXT/RTF etc. So if you have a sheetfeeder the above suggestion is sound. Whip it through the sheetfeeder, output a PDF, then run it through ABBYY at the end

Now if only I could find an OCR on 'buntu or mac that worked as well as ABBYY.

EDIT: Or find someone online whose sharing the book (a lot of what I'm scanning there's no chance of that) and download from them as above poster stated
On linux, I find that tesseract OCR works pretty well, provided that your using TIF files as input and the resolution is high/low enough (for some old yellow paper backs, a lower resolution results in better output).

Luqman
luqmaninbmore is offline   Reply With Quote
Old 09-30-2009, 07:57 PM   #11
Moejoe
Banned
Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.
 
Posts: 5,110
Karma: 72193
Join Date: Feb 2009
Location: South of the Border
Device: Coffin
Quote:
Originally Posted by luqmaninbmore View Post
On linux, I find that tesseract OCR works pretty well, provided that your using TIF files as input and the resolution is high/low enough (for some old yellow paper backs, a lower resolution results in better output).

Luqman
I didn't have much luck with Tesseract (probably not using it with the correct settings). I'll give it another shot though when I'm back on 'buntu, thanks for the recommendation
Moejoe is offline   Reply With Quote
Old 09-30-2009, 08:00 PM   #12
igorsk
Wizard
igorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfoldedigorsk reads XML... blindfolded
 
Posts: 3,443
Karma: 52235
Join Date: Sep 2006
Location: Belgium
Device: PRS-500/505/700, Kindle, Cybook Gen3, Words Gear
You should consider investing in FineReader, its recognition accuracy plus dictionaries and spellcheck really helps to reduce the proofreading part.
igorsk is offline   Reply With Quote
Old 09-30-2009, 08:02 PM   #13
Moejoe
Banned
Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.Moejoe did not drink the Kool Aid.
 
Posts: 5,110
Karma: 72193
Join Date: Feb 2009
Location: South of the Border
Device: Coffin
Some brilliant videos from the DIY Book Scanner project (instructables videos)


http://www.instructables.com/id/DIY-...campaign=video

And what's even odder about all this is that Eben Moglen predicted this would happen in a talk I posted a video link to a couple of weeks ago. Strange how things fit together

Last edited by Moejoe; 09-30-2009 at 08:06 PM.
Moejoe is offline   Reply With Quote
Old 09-30-2009, 08:17 PM   #14
=X=
Wizard
=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.
 
=X='s Avatar
 
Posts: 3,672
Karma: 12205348
Join Date: Mar 2008
Device: Galaxy S, Nook w/CM7
Quote:
Originally Posted by luqmaninbmore View Post
On linux, I find that tesseract OCR works pretty well, provided that your using TIF files as input and the resolution is high/low enough (for some old yellow paper backs, a lower resolution results in better output).

Luqman
I use to think that too until I ran the same book one with tesseact and with ABBYY.

There is no comparison, ABBYY is just so superior. ABBYY has a very low error rate, detects images and leaves them as such, converts tables perfectly and even handles white spaces.

There are some images Tesseact does work great with but as a general OCR program this tool leaves a lot to be desired.

=X=

Last edited by =X=; 09-30-2009 at 08:19 PM. Reason: grammer
=X= is offline   Reply With Quote
Old 09-30-2009, 08:27 PM   #15
Hellmark
Wizard
Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.Hellmark ought to be getting tired of karma fortunes by now.
 
Hellmark's Avatar
 
Posts: 2,521
Karma: 3638167
Join Date: Jun 2009
Location: Maryland Heights, Missouri, USA
Device: Nokia N800, PRS-505, Nook STR Glowlight
Quote:
Originally Posted by =X= View Post
I use to think that too until I ran the same book one with tesseact and with ABBYY.
Problem is, ABBYY only makes for Windows. OSX and Linux users are screwed. Tesseract is opensource, with native ports to those OS's.
Hellmark is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
calibre crashes when scanning and adding books oncdoc Calibre 8 04-21-2010 04:03 PM
Scanning books - New need help Sporadic Workshop 9 04-19-2009 02:11 PM
Scanning paper (out of copyright) books. Charles Gray Workshop 18 03-25-2009 03:06 PM
Scanning books Nate the great Lounge 10 11-04-2007 02:20 AM
Scanning books from your own library Alexander Turcic Deals, Freebies, and Resources (No Self-Promotion) 13 06-16-2006 01:28 AM


All times are GMT -4. The time now is 03:22 PM.


MobileRead.com is a privately owned, operated and funded community.