MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Workshop (https://www.mobileread.com/forums/forumdisplay.php?f=178)
-   -   From physical to digital (https://www.mobileread.com/forums/showthread.php?t=44644)

maynard 04-10-2009 04:33 PM

From physical to digital
 
I have a somewhat large library of real books. They're a real PITA to carry around, and even worse to ship for a household move. So, I'd like to consider scanning in my book collection for use on an ereader. Some questions:

- Do most people just scan to .jpg and format for the screen?

- If so, about how large of an image file is each page?

- What tools do you use to batch process the files?

- auto crop and rendering to for a specific device's resolution


Or are folks using OCR to convert to text?

- Do you notice significant errors creeping into the text?

- What do you do to fix those errors?

- How do you find obvious text errors without manual editing?

BTW: I just bought a Sony PRS 505, but I expect to buy one of the larger full page units as soon as they hit the market (or the wireless iRex unit with a functional firmware hits).

pepak 04-10-2009 05:18 PM

Quote:

Originally Posted by maynard (Post 424620)
- Do most people just scan to .jpg and format for the screen?

Don't know about most people, but I OCR the scans and take pains to proofread and correct them.

Quote:

- If so, about how large of an image file is each page?
In 300DPI PNG, which I use, something between 1.5 and 3 MB per page, depending on complexity. Much less with JPEG, obviously.

Quote:

- Do you notice significant errors creeping into the text?
Way too many, yes.

Quote:

- What do you do to fix those errors?
They need to be hand-fixed.

Quote:

- How do you find obvious text errors without manual editing?
A few typical errors can be found and fixed with regular expressions, but mosty of them require manual approach.

Steven Lyle Jordan 04-10-2009 05:46 PM

Guys, a method I've found to improve the quality of the OCR process involves photocopying of the book's pages first, expanding the page image up to letter/A4 size. Then scan those letter-sized pages into an OCR scanner. Many of the better OCR scanners can allow standard-sized paper to be fed into them and read at high-speed, removing the need to manually scan each page (though you'll still end up doing that at the earlier copier stage). And the expanded letters will be easier for the OCR program to read, resulting in fewer errors.

Personally, I feel the 2-step photocopy-scan process is worth the creation of scanned pages with fewer errors.

Occasionally, you luck out and discover that a particular error happens regularly, and you can fix it with a "find-and-replace all" process. But you should still go through every page manually.

pepak 04-10-2009 05:52 PM

Quote:

Originally Posted by Steve Jordan (Post 424737)
Guys, a method I've found to improve the quality of the OCR process involves photocopying of the book's pages first, expanding the page image up to letter/A4 size.

You could just as well scan in higher resolution, e.g. 600 DPI.

Quote:

Many of the better OCR scanners can allow standard-sized paper to be fed into them and read at high-speed, removing the need to manually scan each page
Actually, the scanning process is a lot less demanding than I originally thought. With Plustek OpticBook 3600, I am doing slightly more than 100 pages every 20 minutes, getting a normal-sized book completely scanned in about an hour.

maynard 04-10-2009 06:42 PM

Do you cut the book up with a razor and then feed the pages through the feeder? I have a Brother multifunction unit with a feeder that supports double-sided scanning. Or are you doing everything possible to save the original copy and binding?

pepak 04-11-2009 01:23 AM

With OpticBook, there is no need to damage the book.

Steven Lyle Jordan 04-11-2009 11:44 AM

Quote:

Originally Posted by pepak (Post 424744)
Actually, the scanning process is a lot less demanding than I originally thought. With Plustek OpticBook 3600, I am doing slightly more than 100 pages every 20 minutes, getting a normal-sized book completely scanned in about an hour.

Professional autofeed scanners (the kind used by document printers like the Xerox DocuTech and similar platforms, found in many on-demand print shops like Kinko's) can scan 100 pages in 2 minutes or less, and with the right OCR software, generate scanned images of those in under 10 minutes. (Not that the Optibook rate is bad, I'm just saying there are faster methods that also work well.)

Usually, the only catch is getting shops that have this equipment to allow you to use it, as they tend to assume your scanning a book to do an OCR is probably illegal...

pepak 04-11-2009 01:51 PM

Quote:

Originally Posted by Steve Jordan (Post 425291)
(Not that the Optibook rate is bad, I'm just saying there are faster methods that also work well.)

They are, but
1) I can't afford the non-destructive ones, and
2) I don't want to use the destructive ones.

Steven Lyle Jordan 04-11-2009 04:27 PM

Quote:

Originally Posted by pepak (Post 425382)
They are, but
1) I can't afford the non-destructive ones, and
2) I don't want to use the destructive ones.

Right. But remember, I said to photocopy the pages first, which does not have to be destructive... and run the photocopied pages through the scanner. Trust me, I have done this, and it's not so bad, and not that hard on the book (provided it doesn't have a flimsy spine).

chumbucket 04-12-2009 12:50 PM

I'm torn as to which method to use to convert some books so that I can read them on my Sony 505. Should I plunk down the cash for a Opticbook scanner or rig up a homebrew setup to take pictures of the pages? I want to save money but I also want the fastest method. I was contemplating getting a new camera soon anyways?
What is the width of the lip on the Opticbook scanner anyways?

pepak 04-12-2009 12:59 PM

Quote:

Originally Posted by chumbucket (Post 426038)
What is the width of the lip on the Opticbook scanner anyways?

Not sure what you mean by "lip", but I assume it to mean "how much space do I need between the spine and the start of the text". With OpticBook, you need at least 6 milimeters or so, with 8-10 milimeters being comfortable enough. It depends on the tightness of the book, of course, and I am having much easier time with hardcovers than with paperbacks.

chumbucket 04-12-2009 02:39 PM

Quote:

Originally Posted by pepak (Post 426048)
Not sure what you mean by "lip", but I assume it to mean "how much space do I need between the spine and the start of the text". With OpticBook, you need at least 6 milimeters or so, with 8-10 milimeters being comfortable enough. It depends on the tightness of the book, of course, and I am having much easier time with hardcovers than with paperbacks.

Do you find that you have to hold the book down to get it to stop curling up? Is that why you say that hardcovers work better. How does the Opticbook scanner work with Abbyy Finereader? How long would you say it takes to scan say 100 pages or so?

Thanks!

pepak 04-12-2009 03:28 PM

Quote:

Originally Posted by chumbucket (Post 426140)
Do you find that you have to hold the book down to get it to stop curling up?

I probably wouldn't have to hold it, but I get better results if I do.

Quote:

Is that why you say that hardcovers work better.
Mostly because hardcovers are a lot more tolerant to full opening without getting damaged. Paperbacks, I tend to read while opened less than 90 degrees to preserve the spine.

Quote:

How does the Opticbook scanner work with Abbyy Finereader?
I have never tried scanning from inside FineReader (*), but when I scanned in the application provided with the scanner and then imported the images into FineReader, it worked just fine.

*) The scanner and its software were apparently created by someone who thought long and hard about how it would be easiest to scan books, and it shows - it really is very comfortable and easy to do. Most of my scanning is done without looking at the screen, with just occasional glances to make sure everything is still fine.

Quote:

How long would you say it takes to scan say 100 pages or so?
It really depends on the book, but I average 100 pages every 20 minutes (give or take a minute).

Jellby 04-12-2009 04:24 PM

Quote:

Originally Posted by pepak (Post 426169)
Mostly because hardcovers are a lot more tolerant to full opening without getting damaged. Paperbacks, I tend to read while opened less than 90 degrees to preserve the spine.

It's a matter of which kind of binding they have. Hardcovers are usually sewn, while paperbacks are just glued (or whatever the right terms are).


All times are GMT -4. The time now is 09:58 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.