![]() |
#1 |
Enthusiast
![]() Posts: 49
Karma: 14
Join Date: Jul 2010
Location: Harrogate, England
Device: iPad
|
Digitising Paperbacks
Hi, Guys.
I've around 5,000 paperbacks which are moldering in my garage and I would like to digitise them. With this number, I need to feed them automatically (more or less) and I need the feeder to be reasonably reliable for 10x17cm pages. Ideally it will be duplex. I'm prepared to spend some money on this both for hardware and software, but there are clearly limits. I would appreciate any suggestions or input that you have. Iain |
![]() |
![]() |
![]() |
#2 | |
Interested Bystander
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,726
Karma: 19728152
Join Date: Jun 2008
Device: Note 4, Kobo One
|
Quote:
Best place to ask would be the Distributed Proofreaders site (www.pgdp.net), there are people there who scan in bulk. (Specifically the Content Providers forum) For software the standard OCR software is Abbyy FineReader. |
|
![]() |
![]() |
![]() |
#3 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 197
Karma: 1010202
Join Date: Mar 2010
Device: iPod Touch
|
Yeah, a number of us have done a lot of work for the Gutenberg project, and I can say that the fastest way to do this is not fast at all. But a good fast flatbed scanner, and Abbyy FineReader is pretty much the consensus on the way to go.
If you're going to do this, you want to concentrate on the books that are out of print and unavailable in any other format first. Even if you don't want to re-buy, you always CAN re-buy. (And though this may be a controversial statement: you might also check out the pirate sites. It's illegal and unethical to upload, but if you're planning to scan anyway, you may consider whether it's okay to simply download what you already own. Just remember that there is no quality control.) Camille |
![]() |
![]() |
![]() |
#4 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 118
Karma: 202232
Join Date: Jun 2010
Location: Texas
Device: Kindle Paperwhite Gen2
|
I don't scan books, but do have experience with scanners. You really want to put your money into OCR resolution. The old saying "garbage in, garbage out" is never more true than here. If the resolution isn't high enough, letters such as "ni" can end up as "m", for example. High OCR resolution will reduce editing by making sure it gets it right the first time.
|
![]() |
![]() |
![]() |
#5 | |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 118
Karma: 202232
Join Date: Jun 2010
Location: Texas
Device: Kindle Paperwhite Gen2
|
Quote:
I'm not an attorney, so don't take this as legal advice, but my understanding is that as long as you own the media in one format, you can get it in whatever other formats you want. So if he owns the book as a paperback, it would not be illegal for him to download the torrent. Don't take that as the final word though. Do your own research before you take my word for it. |
|
![]() |
![]() |
![]() |
#6 | |
curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,487
Karma: 5748190
Join Date: Jun 2006
Location: Redwood City, CA USA
Device: Kobo Aura HD, (ex)nook, (ex)PRS-700, (ex)PRS-500
|
Quote:
All of the above assumes U.S. law, and U.S. jurisdiction, post-DMCA. Once again, I am not a lawyer; your mileage may vary; package filled by weight, not by volume. Xenophon |
|
![]() |
![]() |
![]() |
#7 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 11,732
Karma: 128354696
Join Date: May 2009
Location: 26 kly from Sgr A*
Device: T100TA,PW2,PRS-T1,KT,FireHD 8.9,K2, PB360,BeBook One,Axim51v,TC1000
|
On the hardware side, I would suggest paying attention to the return time on flatbed scanners.
Optical resolution is important for clarity (especially with small-text books) but almost as important is the *time* it takes to scan each page. Most scanners will tell you they can do so many pages per minute but that is a maximum for low-resolution B&W scans; few break it down into scan and return time. Slow scan times are unavoidable (You want a quality scan, after all) but slow return times are avoidable. They just cost extra to avoid. In an ideal scenario you want the scan array to return in the time it takes you to flip the page and reposition the book. Otherwise the job can get painfully dreary waiting for...the..head...to...return... Also, be aware, the best scanners for text tend to be sub-standard for images and vice-versa (unless you have really deep pockets). For a while I had access to a professional Epson GS series scanner that was a joy to use. (I did a couple of 400-page books in less than an hour each.) But at US$3500 new way out of reach. (Even used, on eBay they ran way too rich for my budget. But I thought of it.) A similar attempt on my cheap home scanner ran me over three hours. For my new scanner, I paid a bit extra to get a fast return time. It's nowhere near the Epson in speed but it matches my page-turning speed, so my mind doesn't wander much. |
![]() |
![]() |
![]() |
#8 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 118
Karma: 202232
Join Date: Jun 2010
Location: Texas
Device: Kindle Paperwhite Gen2
|
Great Summary. Thanks Xenophon
|
![]() |
![]() |
![]() |
#9 |
Enthusiast
![]() Posts: 49
Karma: 14
Join Date: Jul 2010
Location: Harrogate, England
Device: iPad
|
Wow!
I really appreciate your comments - lots to digest there. I won't be downloading from bit torrent - I'm a freelance developer and work a lot with media stuff so I'm cautious about copyright issues - I would - in a layman's way - agree with Xenophon about the legal issues. I'm also mainly going to be doing the books as I find them. They've been locked in boxes in various garages for around 10 years and one of the reasons for the project is that many are now too 'foosty' (smelly) to read comfortably now. I've had surprisingly good results (well the surprised me) from Iris software which came with my HP multi-function device, but I will most certainly check out the Abby products. Finally, I'll check out the Distributed Proofreaders site. I think the most uncertain thing for me is finding a decent ADF scanner. I don't have 2 years to stand in front of a flatbed, however good the results may be! Thanks Iain |
![]() |
![]() |
![]() |
#10 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,013
Karma: 251649
Join Date: Apr 2010
Location: Tempe, AZ, USA, Earth
Device: JetBook Lite (away from home) + 1 spare, 32" TV (at home)
|
Quote:
I'm doing the same thing as you. I found the http://www.amazon.com/Fujitsu-ScanSn...8680389&sr=8-1 to be excellent for the task. It's a compact ADF scanner that includes PDF software (Adobe Acrobat 9 Standard), OCR, and ssome organizational software. There is a deluxe package that includes additional organizational software. It was only $20 more at the time so I went ahead and got it. I haven't broken the seal on the package yet. I started out cutting the spines off the books with a bandsaw. It worked but if the blade goes through any of the hotmelt glue used to bind the books (trust me, it will), the glue will transfer to the blade, then to the tires, then you get to clean all that up. Also, the cutting process leaves a fine paper dust (more like powder) on the cut surface that is impossible to completely remove before running through the sacanner. The scanner does an excellent job of collecting it and that miserable stuff gets into all the works, especially if one makes the mistake I did and try to blow out the dust. Since it is still under warranty, I'll have to take it in to be internally cleaned. I bought a guillotine type paper cutter and, while it will get the job done, it's a piece of junk. I'm going to use it until it breaks and look elsewhere for one. It does give clean cuts with practically no dust. I scan the covers with a color setting as JPEGs and save them to a temporary folder on my desktop, the scan the inside pages with a black and white setting directly to PDF. I then use Acrobat to insert the covers to the PDF. I've found I also have to scroll through the pages to make sure I fed the pages into the scanner in the correct order (the scanner will hold an average of 50 sheets, depending on paper thickness) and to check for the rare page that may need cropping. Using the black and white setting gives crisp text on a white background, even if the pages are badly yellowed. Any illustrations will probably look lousy unless they are very simple line drawings (most of my books are unillustrated). There are ways to deal with that, though. The scanner comes with ABBY Fine Reader but when used from within ScanSnap's software, it will just convert the PDF from images to a searchable content. I do not know if this version will work separately to convert the scan to text suitable for e-book readers. Then you would have to edit the results. I don't bother with the OCR; with as many books as I have (roughly 1200), it would take too darned long (a 100 page magazine takes an hour without editing). There are e-book readers that have decent zoom, though, so you could zoom out the margins to make the text large enough to be readable. I'm waiting for the technology to improve and the prices to come down. I have an Astak I got for over 50% off but it's suitable only for the smaller paperbacks (fortunately, the majority of them). At home, I've been reading from my 32" TV (it's patched into my computer) using Adobe Acrobat to read it. I just zoom in until the text is comfortable to read from a distance (roughly ten feet right now) and use a wireless mouse for a "remote" to scroll. I find it's preferable to hanging onto a physical book. I can read in a darkened room, too. If you decide on the ScanSnap, drop me a PM and I can give you more detailed tips on how to scan the books. |
|
![]() |
![]() |
![]() |
#11 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 412
Karma: 546196
Join Date: Mar 2009
Location: UK canal boat
Device: sony prs505, prs650, kobo Glo HD liseuses
|
FWIW I recently purchased a Canon P150 scanner - very small footprint, duplex sheet feeder scanner. And I'm absolutely delighted with it. It accepts up to 20 sheets at a time, but to play safe, I load it with 10 sheets and then just leave it to rip. Compared with the dubious entertainment of manually feeding a flat-bed scanner there's no contest. I love it!
The production chain now is: Steel rule & sharp knife to disassemble old paperback; Break book into bundles of 10 sheets; Scan bundles to tiff files; Run tiff files through Abbyy scan to Office; Perform elementary spell check in Word, export as .txt; Spend lots of time beating the text into acceptable html shape in editor of your choice (I use NoteTab Light 'cos it's free & I'm stingy); (Three loud cheers) Perform 'nice' formatting of text in Sigil; Add book to library and to reader in Calibre; Enjoy. |
![]() |
![]() |
![]() |
#12 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,449
Karma: 58383
Join Date: Jul 2009
Device: Kindle, iPad
|
Quote:
|
|
![]() |
![]() |
![]() |
#13 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 11,732
Karma: 128354696
Join Date: May 2009
Location: 26 kly from Sgr A*
Device: T100TA,PW2,PRS-T1,KT,FireHD 8.9,K2, PB360,BeBook One,Axim51v,TC1000
|
Quote:
If you can set the scan window to cut out the page headers and/or footers you should get something that is pretty good and readable out of the OCR Software. I've used both Abby and Nuance Scansoft (mostly the latter) and what I do is scan to a MS Word dual column format (for two page at a time scans) and save as a "true view" word doc that retains the dual column layout. Then I open it in WordPad and resave as rtf. This seemlessly blends the layout into a single text stream. I save both files until I'm ready to proof. Then I open it in MS Word, run a cleanup macro and begin proofing. If you skip proofing you should have a file about as readable as a Typical Topaz or PDF file, usually better. The most common problem is with paragraph ending and that's something the macro can easily handle. The reason I stick with the extra trouble of the flatbed is that, while I don't care about the smell of boooks, it goes against the grain to destroy a book under any condition. ![]() I'll have to consider subcontracting the page flipping. ![]() I have some young (tween) cousins who can probably use the pocket money. Thanks for the hint. Last edited by fjtorres; 07-09-2010 at 04:51 PM. |
|
![]() |
![]() |
![]() |
#14 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 412
Karma: 546196
Join Date: Mar 2009
Location: UK canal boat
Device: sony prs505, prs650, kobo Glo HD liseuses
|
Post scan, the readability varies a lot:
- the ocr gets hung-up over the signature at bottom of pages; - the odd spot of mould, squashed fly etc doesn't help the process; - being older texts, with older fonts, the ocr struggles ('lie' for 'the' eg); - usual problems with hyphenation; - confusion with single quotes and double quotes. So, readability ranges from OK to Yukh! But since I'm going to put it into Sigil anyway, I might as well make a good job of it. Being an anorak with pretensions to nerd-hood, I've kept a record (actually a database) of the time taken to create ebooks. I find that a pre-existing electronic text takes about 3 hours, including collection of graphics for front cover etc. Unsurprisingly, the full editing process from original paper takes about 9 hours. Plus unspecified amount of time to read properly and pick up the remaining typos. |
![]() |
![]() |
![]() |
#15 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,449
Karma: 58383
Join Date: Jul 2009
Device: Kindle, iPad
|
Thanks, fjtorres and alecE.
Sure thing about the subcontracting possibility, fj. What do you think is a fair rate to pay? |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Electronic versions of out-of-print paperbacks | Gannett | General Discussions | 78 | 10-19-2010 04:03 PM |
Inflation and historical cost of paperbacks | Fbone | General Discussions | 19 | 10-19-2010 02:23 PM |
how much have you saved ebooks paperbacks | lost66615 | General Discussions | 22 | 10-08-2010 05:47 PM |
Old Paperbacks | bobavey | Reading Recommendations | 7 | 10-06-2010 07:48 AM |
My Email to Orbit Books about Ebooks price higher than Paperbacks | luqmaninbmore | General Discussions | 0 | 04-09-2010 09:15 AM |