View Full Version : I want to digitize my paper books.


llwwss
01-11-2008, 04:25 AM
I have many thick academic books in paper form that i want to read anywhere.
Unfortunately, these books are not available in ebook forms, and i doubt it will be possible any time soon.
So i want to make my own ebooks out of paper books that i already have so that i can read them on the go and on the bed with ebook readers like prs505 or Cybook gen3.

As an individual, is it be possible to digitize books into ebook files
or should i contact a company which does book digitization?
If i can do this myself, what equipments do i need?

DMcCunney
01-11-2008, 04:51 AM
As an individual, is it be possible to digitize books into ebook files or should i contact a company which does book digitization?
If i can do this myself, what equipments do i need?Forget it.

It's very difficult and time consuming, even if you already have the required equipment and skill in using it. If you don't have the equipment or the skill, it will be close to impossible.

And a company that does this won't help you either. Their first question will be "Do you have the right to do this?". You don't. Someone else holds the rights and would have to approve it. They would be in violation of the law doing it without that approval, and they won't touch the job.

And even if you got the approval, it would be far more expensive than it was worth. I can easily see a charge of thousands of dollars per book.

If the books you want to read don't exist in electronic format, resign yourself to paper editions. Seriously.

(For an idea of what has to be done to make a paper book into an electronic version, visit the Distributed Proofreader's site, at http://www.pgdp.net/c/default.php . They do the proofing on the files that become Project Gutenberg titles.)
______
Dennis

slayda
01-11-2008, 09:56 AM
I have many thick academic books in paper form that i want to read anywhere.


Unfortunately, as stated, the typical academic book is quite difficult due to such things as images, figures, charts & equations, unless you can be satisfied with the PDF files that result from the scanning. These can be very large files, slow to read on the typical ebook reader as well as being too small to view adequately.

I have done volunteer work for Project Gutenberg and they don't even have the proof readers bother with these. They have what they call formatters do this work.

On the other hand, if it is purely text, then it is a job you can handle yourself with an appropriate scanner and OCR software plus some sort of editing software such as MS Word. It is still a lot of work. How much work partly depends on the equipment and SW and your experience.

vivaldirules
01-11-2008, 10:11 AM
I've tried doing a few chapters of a few books and it's really tough going. The best net rate that I could do was about one minute per page with the result being a PDF file that I had to run through PDFLRF to put on my Sony Reader. At that rate, I'd need to add a few decades to my life expectancy to do this for my library. Ain't happenin'.

recycledelectron
01-11-2008, 10:47 AM
Forget it.

It's very difficult and time consuming, even if you already have the required equipment and skill in using it. If you don't have the equipment or the skill, it will be close to impossible.

Nonsense. $60 and a weekend will get the first one, if you have a decent digital camera. A few hours per book after that.

Don't let these illegitimi carborundum you.

And a company that does this won't help you either. Their first question will be "Do you have the right to do this?". You don't. Someone else holds the rights and would have to approve it. They would be in violation of the law doing it without that approval, and they won't touch the job.

You are ASS-U-ME ing that the books have copyright notices that prevent them from being copied into digital form. That's absurd. Many books have copyright notices that allow you to convert them to another format for your personal use.

And even if you got the approval, it would be far more expensive than it was worth. I can easily see a charge of thousands of dollars per book.

LMAO! I rip several books a day, it's easy.

Here's the hardware you need:

(1) digital camera, preferably SLR, in the 5 MP or better range for most books. Academic books can be large (8.5" x 11" pages) with small text. In that setting, more MP is better. I'm currently using a ($80) 6.2MP Samsung S630, point-and-shoot. It works for text-bsed hard covers, but it sucks for huge, college math books. I'm saving for a 10-12MP DLSR.

(2) tripod (mine was $18.88 at Wal-Mart.)

(3) book cradle - search this site and Google - I got all my ideas from a few searches. Mine cost $40 in parts. See http://www.mobileread.com/forums/showthread.php?t=13848&highlight=cradle

Unfortunately, as stated, the typical academic book is quite difficult due to such things as images, figures, charts & equations, unless you can be satisfied with the PDF files that result from the scanning. These can be very large files, slow to read on the typical ebook reader as well as being too small to view adequately.


BS

The correct solution is to snap photos of the books, and use these photos (JPEGs.) Yes, a 500 page math book can take over a gig, but it's usable. I use a 2GB SD card in my PRS-505 to store a book, and flip through the images. As processing power increases, we'll be able to use them even more easily. As OCR software improves, we may (one day) be able to OCR the equations.

Andy

JSWolf
01-11-2008, 10:50 AM
Moved since this is a general purpose thread and not just for the 500/505.

vivaldirules
01-11-2008, 11:35 AM
Don't let these illegitimi carborundum you.

My apologies! I think your efforts are valiant, recycledelectron, but I don't think this is an activity for the faint at heart. If a 6.2 Mpixel image is not good enough for a textbook page, my heart flutters to imagine what is. And "paging" through a book by flipping between jpegs that total 1 Gbyte or more has me swooning. I'm glad this works for you but this won't for me - ever. I need a process that is a lot less intense and time-consuming.

I have images of you in a dark basement frantically turning pages. Camera flashes lighten the room every few seconds. The flipping of pages and the whir of fans from a couple of PCs accumulating the photos is the only sound. This goes on from early morning until late at night for days at a time. You've given up work and family and the only time anyone sees you is when the pizza guy shows up. He sees the evil grin on your sweating face as you tell him about how quickly you digitized the complete Oxford English dictionary last week. He happily leaves quickly with a small tip.

Sorry for having fun with you but I couldn't help it. I certainly hope the image I have is wrong!:)

dcalder
01-11-2008, 04:10 PM
If you're looking for "readable" rather than archive-quality, all you need is a decent scanner and VueScan. Heck, even "archive-quality" is do-able, as long as your original book isn't too fragile to handle being opened out flat on the scanner bed.

Using VueScan Professional and my beloved old AcerScan 610ST (with the Adaptec USBXchange SCSI-to-USB adapter), I've done a few doujinshi for scanlation projects and a few out-of-print fanzines for friends who had material published in them but couldn't afford the zine at the time (not all fanzines can afford to give free copies to contributors). A 100+ page fanzine/doujinshi scanned on the "magazine" setting ends up somewhere around 2GB as raw DNG files. Either save as more than one file-type during the scanning process or you can later point VueScan back at those DNG files and re-scan them to a multi-page TIFF or PDF. With a size-reduction setting of 3, you end up with a single file in the neighbourhood of 150+MB - averaging just over 1MB per page for text pages. I think that file with size-reduction setting of none ended up around 330+MB. Keep in mind, of course, that these are pure image PDFs, not text! VueScan can also output as JPEGs, with both the size-reduction setting and file compression setting being configurable.

VueScan can do OCR as well, so text PDFs should be possible, but I have yet to attempt it (though if I do decide to, I can work directly with the DNG files and don't need to re-scan the original). The image PDFs are more than clear enough for the purposes that they're being used for. I highly recommend that anyone who's ever cursed their scanning software take a good long look at VueScan. It's reasonably priced and infinitely better than any other scanner software that I've ever checked out. It can "scan" from disk, scanner, digital camera, etc., and can be set up to do automatic scans at regular intervals, batch scans, etc. A very useful, versatile 'tool' for anyone's software 'toolkit'.

Edit:

Just took a few minutes to toss one of the afore-mentioned scan-generated PDFs on my Cybook. Considering all the previous comments on the complete unsuitability of any ebook reader other than the iLiad for reading PDFs, I hadn't bothered before. But, in the interests of research, I thought it was worth a shot.

The book in question is 109 printed pages, mainly text but with a few drawings and comics; only a couple of pages are full-colour. The PDF file is 168MB and the Cybook handled it with ease. There was a slight delay in turning pages, but then, there's also a slight delay in paging through it on the computer (much like 90% of the PDFs I've ever viewed), so that's rather a moot point. I was able to read it in portrait mode fit to page (yes, really!) but then I run my computer monitor at a resolution that makes other people squint and reach for a magnifying glass, so... *shrug* In landscape mode, fit to width, it was perfectly readable for the average person - probably comparable to the text in the average mass market paperback. And the original of this is an 8 1/2" x 11" fanzine, with text in two columns. So, the answer to the question "is scanning a book as PDF for viewing on the Cybook possible" is definitely a resounding "yes" - at least for someone with reasonably good vision (I wear glasses for distances but not for reading and usually not for the computer monitor either).

I'd suggest, in future, that the more reasonable response to questions about PDFs in general on the Cybook be less of an immediate "no, they're not any good" because, frankly, I think that's a rather inaccurate answer and won't necessarily hold true for everyone. They're not necessarily unreadable, even if they haven't been optimized for viewing on such a small screen. If I were really planning to use this particular file on the Cybook, I'd probably run VueScan back through the raw DNG files, crop out the excess margins to improve display size, maybe play a bit with the file-size reduction settings in hopes of improving display speed, and then generate a new PDF, at which point I would probably be comfortable reading the whole thing in portrait mode.

Note: As far as scanning time goes, this particular 109 page zine took two-three hours to scan - in part because, while I was doing that on the desktop, I was playing a game and browsing the web on the laptop. Theoretically, it should be possible to get the scanning done much more quickly, if that was the only task being carried out.

HarryT
01-12-2008, 05:22 AM
Note: As far as scanning time goes, this particular 109 page zine took two-three hours to scan - in part because, while I was doing that on the desktop, I was playing a game and browsing the web on the laptop. Theoretically, it should be possible to get the scanning done much more quickly, if that was the only task being carried out.

If you're willing to destroy the book, removing the binding and using a scanner with a sheet feeder will get the job done in minutes. That's how DP get their page scans, I believe.

Sparrow
01-12-2008, 03:16 PM
I'd suggest, in future, that the more reasonable response to questions about PDFs in general on the Cybook be less of an immediate "no, they're not any good" because, frankly, I think that's a rather inaccurate answer and won't necessarily hold true for everyone. They're not necessarily unreadable, even if they haven't been optimized for viewing on such a small screen.

This is a good point :2thumbsup.
I've only recently tried PDFs on my CyBook because I'd seen the negative reports here - but was surprised to find that they're actually perfectly readable (for me - I'm nearsighted and can read PDFs on my CyBook without my specs).
I can appreciate some people might have problems; but everyone should see for themselves - they may be pleasantly surprised. :)

RWood
01-12-2008, 04:30 PM
I did some scan/ocr work for the Harvard Classics series. It is not hard. depending upon your ability at editing it can be a nightmare or something less. (Years of editing helped for me.)

The only way to know for yourself is to try it for yourself. Don't let any of us stand between you and your goal. We all have experience, but not your experience.

slayda
01-12-2008, 04:35 PM
Yes PDF "images" can be readable, depending on the original size. If scanned from a book with pages near the size of the Cybook screen then there should be no trouble reading it. However you will not have a book, only a series of pictures of pages stuck together. It will not reflow, you won't be able to use a dictionary on the images of the words, etc. In addition, most academic books have larger pages, some even larger than 8.5 x 11. Even with young eyes, this will be difficult reading.

What I spoke of is creating editable text from scanned books. And it is true that eventually we will have equation editors, etc. but we don't now, at least in general. (There are some very specialized equation editors.)

As a comparison, I recently scanned a paperback book with over 1000 pages. Scanned at 600DPI, it took almost an hour to scan. Then I spent about 4 hours cleaning up the OCR errors. This was a good quality printed book. Cheaper quality usually generates more errors.

This experience, including one that had a half dozen equations that I kept as JPEGs in the text, is what I based my previous statements on. BTW the scanned PDF file was about 122.6 MB but the final RTF was only about 3.4 MB. IMO a significant reduction.:bookworm:

-Thomas-
01-12-2008, 09:04 PM
I'm a student at a german university, and we have integrated copying and scanning devices all over the campus. With these devices you can scan your books very fast in a readable quality (even for figures) and send the resulting PDF format directly via email. Very comfortable! They even have those devices in the reference library :2thumbsup

I already scanned a 300 page paperback (-> 150 scans), it took about 15 minutes. Maybe you have something similar nearby?

Patricia
01-12-2008, 09:26 PM
We aren't allowed to do this in the UK. At my university there are signs above the photocopiers saying that we are only allowed to copy one chapter from a book for copyright reasons. And students aren't given access to scanners without the material being checked for copyright by a librarian.

tompe
01-12-2008, 09:47 PM
We aren't allowed to do this in the UK. At my university there are signs above the photocopiers saying that we are only allowed to copy one chapter from a book for copyright reasons.

Is this restriction for copying to yourself or copying to the class? Our rules are that a teacher can copy a whole book for himself. For the class you can copy a maximum of 15% and 15 pages and distribute in the class. If you want to copy more you have to ask for permission.

recycledelectron
01-13-2008, 02:04 AM
My apologies! I think your efforts are valiant, recycledelectron,

When I referred to illegetimi, I was referring to the copyright mafiAA. I hat it when someone is told they can not legally do something.

The only law should be to not deprive anyone of their life, liberty, or property, except in self defense or in the defense of an innocent person. Ripping a book that is not available as an eBooks is NOT wrong, as it does not deprive anyone of live, liberty, or property.

Telling someone to give up is very distasteful, as it discourages innovation. Innovation is what allows me to live in an air conditioned home, use PCs, and go hunting instead of getting eaten by big predators.

but I don't think this is an activity for the faint at heart. If a 6.2 Mpixel image is not good enough for a textbook page, my heart flutters to imagine what is.

Actually, the 6.2MP camera works fine when correctly focused, but the auto-focus causes me problems. It will get 2/3 of the page fine, but the print near the edge ia a problem when taking in a large page. Therefore, a 6MP DSLR or a 9MP point-and-shoot should work on the worst text books.

Digital cameras are dropping in price so fast, that if the camera's price fazes you, wait a semester and they will be cheaper.

And "paging" through a book by flipping between jpegs that total 1 Gbyte or more has me swooning. I'm glad this works for you but this won't for me - ever. I need a process that is a lot less intense and time-consuming.

I can photo 500 pages an hour. Then, they copy at a rate of several thousand pages and hour to my PC via USB from a card reader. During that time, I can rename them. This is necessary because I snap pics of the odd pages first, and then do the even pages. After I rename them with the page number as the name, they fall in alphabetical order.

It takes a day or two over a weekend to digitize all my text books for that semester, so count off maybe 2 weekends a year to relive myself of carrying a dozen text books at a time. Instead, I'm the one with the tiny notepad-sized case.

My personal library beats the university library, and fits in the passenger seat of my pickup.

As for the GB size, my PRS-505 changes to the next pic as quickly as it flips between pages in a PDF. The zoom works MUCH better on JPEGs than it does on PDFs. I like JPEGs better than PDFs on the PRS-505.

I have images of you in a dark basement frantically turning pages.

Good lighting is essential to good book ripping ;o)

Sorry for having fun with you but I couldn't help it. I certainly hope the image I have is wrong!:)

You are very wrong. I spend 2 weekends a year digitizing my text books, and am the only person on the faculty who does not drag home massive bags of books. I grab my eBook reader, and a note pad in a small case, and go with that. I've got everything I need right there.

Last semester, during finals week, a student walked up to me while I was eating lunch on campus and asked if I had graded his paper. I had previously scanned it and the other papers with an ADF, and saved it on a SD card. While he watched, I dropped the right card in my eBook, graded it, and recorded the grade on a note pad to mark in my online grade book later. He was astounded that I had everything right there in a 1-pound package.

Andy

P.S.

Most people don't read or study.

What would happen if you always had access to every book ever written, and could instantly switch from reading to listening to the audio book at that exact word? (When you get bored, when your eye get tired, or when you have to drive somewhere, you switch seamlessly.)

Could a bright, self-motivated kid get an education in the world's least competent school?

What if that reader did not depend on any outside technology? (i.e., it was solar powered, and rugged like a tennis shoe.)

Think of the regimes that have burned books. Could a government keep its people ignorant?

What would happen to the self reliance of individuals, when they can bring up a manual on auto repair on the side of the road?

How much better off would a patient be, if they could pull up a beginner's medical text when trying to understand a life-changing diagnosis? I've driven to the hospital, and would have liked to find the passage, then ask the eBook to read it to me.

vivaldirules
01-13-2008, 11:42 AM
Well, recycledelectron, I'm very impressed. My apologies, again! A day or two to do several textbooks might be acceptable even for me. Also, using JPEGs instead of PDFs put me off but I agree with you that the zooming and panning works fine and I wish Sony supported that for PDFs. But how do you deal with accessing page 123 and then flipping to page 812? Do you advance ten pages (images) at a time from the menu or do you use a hack? Also, I assume there's no linkable table of contents. Does that slow you down or do you have a solution for that, too?

shousa
01-19-2008, 09:16 AM
I have a number of books I am going to convert using recycledelectron's method of camera and tripod.

Any suggestions or tips recycledelectron over and above what you have written so far? eg how close should the camera be, you know the "finer" points.

Like the above question can you access page 300 then back to 200? (not that that would be a deal breaker for me, just wondering.

This seems good?
http://www.wikihow.com/Scan-a-Book-With-a-Digital-Camera.

jackbrown
01-22-2008, 02:10 PM
A cheap scanner at 300 dpi (black and white!) and software like Abbyy Finereader is all you need for this. Scanning, OCRing and PDFing a book takes a couple of hours. I do it all the time; you can read something else while you do it.

If you're going to use recycledelectron's method, try to figure out a way to quickly turn the images black and white (not grayscale!) as early in the process as possible, and turn the autofocus off; I used a setup like the one he describes for scanning a rare book, and took color pictures (big mistake); also didn't have good enough lighting for a really high contrast ratio. The resulting images basically sucked and I had a nightmarish time making the ebook. It'd be great if your camera could capture in black and white, but it almost certainly can't, so make sure you white balance it against a blank page in the room you are capturing in, then transform the captured files into bw before you OCR. Good luck, and like I said, I think a cheapo scanner is more practical, unless you need really large format captures.

philodox
01-22-2008, 03:14 PM
I've got a couple old books that are nearly falling apart... might be fun to try a scanner with auto feed. Destroying the books wouldn't be a problem at this point. Are there any decent and cheap ones that will take a scan of each side and keep the pages in the right order?

Once I have the images it would be easy enough [though perhaps time consuming] to reformat them as a PDF and use the built in OCR in Adobe Acrobat. Are there PDF to mobi convertors?

Even though each step may take a long time, if I can get a system working that only requires a small amount of user input between these large steps, it might be worth my while. :)

yvanleterrible
01-22-2008, 03:34 PM
I've got a couple old books that are nearly falling apart... might be fun to try a scanner with auto feed. Destroying the books wouldn't be a problem at this point. Are there any decent and cheap ones that will take a scan of each side and keep the pages in the right order?

Once I have the images it would be easy enough [though perhaps time consuming] to reformat them as a PDF and use the built in OCR in Adobe Acrobat. Are there PDF to mobi convertors?

Even though each step may take a long time, if I can get a system working that only requires a small amount of user input between these large steps, it might be worth my while. :)Tried that with a circa sixties book. The paper was so bad that the first page actually got shreaded in the scanner, causing a paper block and necessitating a dismanteling of the device to get at the pieces.
The software included with the machine can take care of the order the pages come out, provided you don't make mistakes in feeding.
Do you have Acrobat Pro? I didn't know it did OCR!?!

aru
01-22-2008, 04:44 PM
Don't forget the Plustek Opticbook 3600, which takes 10-20 sec per page, then if you want it to OCRs it for you. If not it still gets the orientation for even and odd pages right. It has a big button for the next page on the scanner itself, so you don't have to go back and forth to your computer. It scans paperbacks and bound books without problems due to the binding. You only have to open the book 90 degrees. This makes all the difference. In my opinion better than taking pictures with a SLR.
It takes me about an hour to get a reasonable sized book into my PC.

AnemicOak
01-22-2008, 08:01 PM
Here's an automatic book scanner made with legos...

http://www.geocities.jp/takascience/lego/fabs_en.html

slayda
01-22-2008, 08:23 PM
Are there any decent and cheap ones that will take a scan of each side and keep the pages in the right order?



Check out the Scansnap S510 by Fujitsu for a little over $400. (You can check it out on Amazon but won't get the best price there or try the Fujitsu site.). I have the S500. It works very well and comes with good software. Scans two sides at once & you can load up to 50 pages of 20# paper. The better the paper quality (and the larger) the better the final results. Can scan up to 1200DPI in B&W but I've found that 600 DPI is the best compromise between scan quality & speed.

When not in use it has a very small foot print. It is not TWAIN compliant. Output (as I use it) is searchable PDF. I use Nuance's PDF Converter Assistant to create a RTF file for editing.

The only problem I have had was with very poor paper quality in some cheap paperbacks. That resulted in multiple page feeds on a few occasions but mainly it had numerous OCR errors due to the ink bleeding during the printing process.

If you don't mind destroying the book (i.e. taking the pages apart), I highly recommend it.

Gideon
01-23-2008, 01:58 AM
Aru makes a great point, the OpticBook may be a bit of a unitasker, but it's brilliant for scanning books.

In preperation of getting my Sony Reader I went ahead and scanned one of my books. I used to do this when I had a tablet PC so I had some experience. Moving them from OCR'd PDF's to a text format was really where the hrad bit came in.

If you can afford to spend the money on the OpticBook (http://www.amazon.com/gp/redirect.html?ie=UTF8&location=http%3A%2F%2Fwww.amazon.com%2FPlustek-Opticbook-3600-Scanner-Conversion%2Fdp%2FB00065KA72%3Fie%3DUTF8%26s%3Dele ctronics%26qid%3D1200931677%26sr%3D8-2&tag=cityofdoors-20&linkCode=ur2&camp=1789&creative=9325) (a bit under 300 at Amazon, I believe) it is the single best investment you can make in this area - you can, as someone mentioned, scan very quickly and watch a movie at the same time.

The next part is the OCR. This is where it gets tricky, as most OCR programs will absolutely make a wreck of things. I would use greyscale here, btw... in my experience, it comes out better than black and white. Your mileage may vary.

Depending on your platform, you'll have a few options available to you. Most the free ones I've tried are crap. The one that comes with Adobe Acrobat is average, and the best I've used is OmniPage Pro (but hard to get a hold of for an individual, very expensive. Maybe your school or business has it.) Once you OCR it into text the laborous process is going through and cleaning it all up.

The book I made took me about 5 hrs all around, I'd say - but this was a test run, and so there were lots of false starts. I imagine it'd take me about 2-3 hrs now, for an average sized book, and I'd call it worth it.

I plan on writing a tutorial about this once I nail down some fine points. In the meantime, I suggest you look here - it's aimed at Tablet PC users, but there is an enormous amount of useful material here on the subject.

OpticBook Tutorial (http://www.studenttabletpc.com/2005/01/opticbook_3600_and_scanning.html#more)(other methods are mentioned as well on other pages here)

aru
01-23-2008, 06:28 AM
Hi Gideon, my Opticbook 3600 came with a complete software suite including OCR, effectively a turnkey system including ABBYY finereader Sprint, Presto Page Mgr etc. After I installed the software, everything else was automated (except the proofreading :) ).

There is a post already that describes the scanner (which btw enticed me to buy it) http://www.mobileread.com/forums/showthread.php?t=9666&highlight=opticbook
You may want to build on that.

stxopher
01-23-2008, 10:11 AM
One thing to remember if you are looking at the Plustek scanners is not to confuse the Optibook with their new Book Reader. Looks exactly the same but there's a $300 price difference. If you didn't know there were two appliances with from the same company with the same case, photos and basic purpose (scanning books) you might freak slightly and stop looking.

The new Book Reader has a primary focus more on saving the pages as txt, PDFs, PDF text and audio files. (Yea, that last one was audio files. MP3 and WAVs to be precise.) It seems as if it were designed more for keeping the printed word readable for those of us with failing sight than the Optibooks mission was with the saving and shifting of printed information.

Between the two, the Optibook series is still the best bet for most of us scanning books. Its fairly fast, easy and simple at what needs to be done. Still, I sure would like to see the Book Reader in action. Ummmm, making my own audio books for the commute. (No, no, NO! Shut up, little voice in my head with no financial sense and a high gadget lust! Shut up! Need more coffee to drown out the voice!)

philodox
01-23-2008, 11:10 AM
The paper was so bad that the first page actually got shreaded in the scanner, causing a paper block and necessitating a dismanteling of the device to get at the pieces.Yikes, that is something to keep in mind then. :eek:Do you have Acrobat Pro? I didn't know it did OCR!?!I'm actually not sure the exact version that I have, but I can check when I'm at home. It does have OCR though, that I'm sure of.Don't forget the Plustek Opticbook 3600.Never heard of it, I'll do a search and see what I find. Thanks. :)Check out the Scansnap S510 by Fujitsu for a little over $400.Cool, I'll check that out. :cool:

Thanks for the info and tutorial for the Opticbook Gideon. ;)

Gideon
01-23-2008, 01:18 PM
Aru-
I forgot about the OCR support it came with. I always used Acrobat Reader so the only software I used was the actual scanning software. I may need to give it a go though, perhaps its better than OmniPage (And doesn't involve me hauling my stuff to someone with that program!)

snookums
01-29-2008, 03:05 AM
I hear a lot of people here saying that OCR isn't that good. I've found that OCR can be brilliant if you know what you are doing. I feel that OCR gets a bad rep because people don't realize the real magic is in the scanning.

Tip: Scan in RAW format. When you normally scan the data from the scanner is processed with your settings and excess data is discarded. RAW saves all of the data that the scanner gathered. Afterwards you can change settings and see what the result would have been if you had scanned with them. This is especially useful for the first few images where you are trying to find the ideal color balance.

Tip: Scan in Black and White and find the ideal color balance before starting. The color balance is very important. You don't want too much contrast from your scan because that will bring out speckles in the paper that will throw off the OCR software. This is counter-intuitive because you probably wanting to jack up the resolution and contrast to catch all of the detail in the book. Don't. Scan at 300 dpi and set the color or white balance so that you are only getting the text and not the texture of the page.

Tip: Make it straight. OCR software is built to handle horizontal lines of text. If there more than a moderate slant in the way that you were holding the page over the scanner, it will spit out garbled text. Some of the more expensive OCR softwares offer the ability to rotate text, but it's best just to hold the paper straight as possible when you are scanning. That can be harder than you think you are scanning a bound book.

mphuie
01-29-2008, 07:50 PM
As for the GB size, my PRS-505 changes to the next pic as quickly as it flips between pages in a PDF. The zoom works MUCH better on JPEGs than it does on PDFs. I like JPEGs better than PDFs on the PRS-505.


You don't even OCR the pictures, you actually view them on your Sony? It is even possible to read textbook sized pages scaled down an ebook screen? You'd have to manually zoom in and pan around to read anything :blink:

Execution sounds highly flawed.

Gladtobemom
02-06-2008, 01:38 AM
I've put about 30 of my technical references on my tablet PC.

We prepared a little room by installing two daylight ceiling fixtures (each with 4x4ft. daylight bulbs. Then DH put hooks on the ceiling and grommets on a king sized white sheet--and slung it up to tent under the lights.

He deconstructs the books for me by taking the spines off and trimming out the signatures and the sewing. He tries to cut the pages as close to the center of the book as possible.

Then I photograph them with my Pentax K100D (I bought this camera because it takes all my old pentax lenses).

DH and I can do the photography on a 1700 page text in about 8 hours. Yes it's time consuming. Then I make an html web page of them and turn them into a PDF or a Mobi book. IT works great.

I have all the texts I need for reference and teaching in my tablet PC.

I also have them in my little VAIO TR2A.

Total outlay in money, about 50$ for the fixtures and lightbulbs, maybe $10 for the hooks and grommets (had the sheet). The camera was about $500, but I bought it for other reasons.

It is an investment in time. I am NOT distributing these and I own multiple copies. One advantage, I took pictures of the ones with my notes in the margin and linked each page of the clean version with it's annotated version.

I've also put the 3 textbooks that I wrote on Mobi and freely offer the copies to students (after they've bought a copy) in class. I just note it on the copyright page of their copy.

Yep, I destroy the books, so far I've been keeping the copyright pages, pages 16, 99, and the cover. Just to prove that I "own" a legal copy.

Iain
07-08-2010, 12:25 PM
I've just posted a similar question on another thread (before I found this).

Basically, I have 5000 paperbacks and want to scan them. What's currently the most reliable ADF scanner (ideally duplex) for this and what software would you recommend?

Iain

nyrath
07-17-2010, 09:30 AM
Based on recommendations from this forum, I got a Plustek optibook 3600. I've scanned four paperbacks so far, and it has worked reasonably well.

However, I have read reviews that suggest the bulb in the scanner tends to burn out quickly. Though those reviews were several years old.

The main problem I found is that some paperbacks print so close to the book spine that occasionally a couple of letters get clipped from the words. This is not a problem with hardbacks or larger books.

The bundled OCR package seems to work as well as the $100 OCR program I bought years ago (TextBridge Pro 9.0). About one mis-recognized word every four pages or so.

It saves all the scanned pages on your hard drive, so you could use another OCR program if you wish. I have a one terabyte external hard drive so space is not an issue. I scan grayscale 300 dpi TIFF format, so an entire paperback can take up 400 meg or so. Of course, once you've done OCR, you can delete all the TIFF files.

The time consuming part is the post production. I scan, use OCR, it loads it into Microsoft Word, and I save it as filtered HTML (I want to keep all the italic and bold formatting). I use a text processor (UltraEdit) to strip out all the <SPAN> tags, and turn all the <P attribute1="xxx", attribute2="xxx"... tags into <P> tags. I use Calibre to turn the HTML into ePub. Then I use Sigil to put <h1> tags on the chapter headings (which generates the table of contents), manually strip out the footers/headers that say NOVEL NAME page x, and manually correct any spelling mistakes.

I can get a paperback up to the Sigil step in an hour or two, but proofreading and correcting can take quite a long time.

HarryT
07-17-2010, 11:12 AM
I can get a paperback up to the Sigil step in an hour or two, but proofreading and correcting can take quite a long time.

I'm afraid there are no shortcuts to thorough proof-reading. I'm in the process (and have been for a couple of years now) of creating a thoroughly-proofed "complete works of Dickens" here at MR. Each novel takes me about 2 months to proof-read, working at it a couple of hours a day. But that's proofing at the "every comma correct" level, which perhaps isn't required for the average paperback.

Franky
07-23-2010, 10:53 AM
i've given myself a Plustek OpticBook 3600 plus. big word for a small scanner. i did a couple of books and i'm satisfied with the results. not bad for such a simple printer. it takes about 50 min to scan a book of 260 pages. that's the consuming part. the negative part is that the PDF is about 10mb big. that's something i try to change. with calibre you're able to convert into EPub and add the front-page to it.

nyrath
07-23-2010, 09:09 PM
I save my scanned books to MicrosoftWord/ WordPad, not to Adobe. This turns them into text. They wind up being about half a megabyte in size.

charleski
07-23-2010, 09:54 PM
I have 5000 paperbacks and want to scan them.

Unless you have a staff of 100 people ready to work on this project full-time, the best advice I can give you is to forget it.

It's fair to say it takes around an hour to scan a book to image format (no OCR, no conversion, no corrections). That means it will take you over 208 days, working 24/7 around the clock without any breaks, to turn your collection into a huge number of jpegs. It will take several multiples of that time to turn that stack of jpegs into something that is readable, depending on the amount of proofreading you perform.

If you have a few highly-prized books that are out-of-print and unlikely to be released as ebooks, then scanning and converting them would be a worthwhile project that you could perform in a few months of spare time. If your goal is to scan an entire library on your own before you die of old age, then you're chasing a rainbow.

nyrath
07-26-2010, 03:23 PM
Agreed.

I can scan in a 400 page paperback in about two hours, takes about 15 minutes for the OCR program to convert it to text.

Doing an exceedingly rough proofreading job on it can take a week. Doing a perfect job can take months.

Iain
07-30-2010, 05:24 AM
I'm still in R&D mode.

I have bought a book guillotine and a fujistsu 6130 scanner.

I'm still evaluating software. I very much like FineReader, but it seems to have less automation (at the bottom end of the price range) than OmniPage.

Next step is to write some scanning software which will pull in a book at a time and check that it has the right number of pages. I am astonished at the quality of the ADF on this scanner. I don't think it misfeeds at all.

But to my title. at 600dpi, it's taking something like 1 second per page (probably half that) and I'm loading pages in chunks of 50 (100 sides). So scanning a book takes 4-5 minutes. The problem is it takes that in 3 - 5 chunks which is exactly the wrong timing. I plan to work whilst this is happening and reckon I can handle something as mechanical as throwing paper in a hopper without too much distraction. However, I'm concerned about this and am considering a robotic device which will take batches of pages from a stacker of some kind under control of my scanning program. At the moment I'm looking at (the equivalent of) a radio shack robotic arm and a hopper made of balsa wood - just to show the spirit of Heat Robinson still lives over her in blighty!

So if I can bear the tedium (and with a little bit of overhead for slicing books up and management) in theory, I could do 50 books a day (so the whole lot in 6 months).

FineReader takes roughly the same time to process as the scanning does (on a quad core machine, at least), though Omnipage seems a bit slower. However, providing I can automate them (ideally without paying too much for the privilige), then they can run on overnight and tie things up.

Needless to say, this is destructive, which may not appeal to many.

I'll keep you all posted!

Iain

nyrath
08-01-2010, 08:19 PM
The bundled OCR package seems to work as well as the $100 OCR program I bought years ago (TextBridge Pro 9.0). About one mis-recognized word every four pages or so. But sometimes it loses entire sentences!
Nope, I was wrong. The bundled OCR program works just fine, it does NOT loose entire sentences.

What happened was I was doing some post-processing, and my poorly formed search-and-replace was deleting the sentences. The OCR was fine, the lost sentences were my fault.

Lady Fitzgerald
08-03-2010, 08:18 PM
I'm using a Fujitsu Scansnap s1500 ADF scanner to digitize my books. Since I'm doing over 1200 books, I'm just making PDF copies of the pages, concatenated into a single PDF by the scanner software via Adobe Acrobat Standard (included with the scanner) and dispensing with OCR. I have to cut the spines off the books to do this. I started using a bandsaw to do this but that process left a friable cut edge that shed paperdust like a long haired dog sheds hair in the spring. No amount of cleaning could get rid of that dust. Glue from the binding also got onto the blade and tires and had to be cleaned off frequently. The dust was so bad it got into the scanner cameras and had to be professionally serviced, fortunately under warranty (although it took a bunch of esplainin'). I'm now using a guillotine type paper cutter that can handle up to 1 1/2" at a whack (thicker books have to be split in half first). That has dramatically reduced the dust. I'm using a small vacuum to remove the dust that does get on the scanner surfaces.

I first scan the covers, inside and out, to individual PDFs using a color setting. Then I scan the book pages themselves using the black and white setting at a light setting to help "filter" out specks on the page to a single PDF. The B&W setting also eliminates paper yellowing and gives fairly clean, clear text on a white background (heavily illistrated books would require grayscale or color settings which would give somewhat less desireable results). I then use Adobe Acrobat 9 Standard (came with the scanner) to insert the covers into the text PDF. I also scroll through the book to be sure pages were scanned in the correct order (human error happens) and that there aren't any pages that are oversized due to added margins (happens rarely; they are easily cropped in Acrobat). The whole process averages 15 minutes per book.

The scanned books read fine in Adobe Acrobat Reader or in Acrobat Standard (I use the latter since it works just fine and there is no point in having a redundant program). I found the JetBook Light has settings that allow the books to be read on a portable e-book reader (it fits in my purse) with some compromises. I set it to landscape and Fit to Width. That eliminates side margins. My largest books (roughly A4 page size) are readable that way although the print size is a bit fine (and I wear trifocals). Smaller books are much easier. Scrolling down each page took a bit of getting used to because each frame overlaps the previous one a bit and the last frame may overlap considerably. I find the advantage of portability outweighs the disadvanges. If a full page has to be viewed in its entirety, a much larger viewer, like a tablet, would be needed.

Of course, one could apply OCR (at least an hour), run a spell checker, check the spell checker, then edit for scanning errors not picked up by the spell checker. Since I'm such a nitpicker, it would take me as long to edit it as to read it. I don't have that much time (or patience; having ADD makes it worse) so I'm content with the PDFs. They are readable with the right readers.

Since the original books are being destroyed in the process, this is a media change and should pose no legal problems.

nyrath
08-04-2010, 02:39 PM
Lady Fitzgerald, what is the average file size of your PDF ebooks?

Lady Fitzgerald
08-04-2010, 11:42 PM
Huge. Off the top of my head, I would say 15MB. Granted, that is much larger than typical e-books however, I'm estimating that, once I finish scanning my p-book library, my "e-books" will only occupy 20-30GB on the 1T drive in my desktop computer (I still have 768GB free space on the drive). Even my mp3s (roughly the equivalent of 425 CDs) occupy 36GB (they are rather high quality rips). Harddrive space is cheap nowadays. Once I move the innards of my present computer to the new case I've been prepping, I'll have room for 5 more harddrives. With as much room as I have the potential of having more room than I'll ever use anytime soon, even after I start ripping my DVDs.

Obviously, no e-book reader is likely to be able to hold all my books but I don't need for them to. A 1GB card can hold roughly 25 books, more than enough to keep me busy for months since I will read from a reader only when away from the house. I use my 32" TV screen to read from when at home (a wireless mouse makes a fairly decent remote). I had been reading my p-books before cutting them up but I'm finding reading from the computer and e-book reader (currently, the JBL is the only one working) to be so convenient, I'll probably chop and scan the next one in my unread stack and read it from my TV or reader (yes, that means reading two books at the same time; doesn't bother me).

Mr. Dalliard
08-05-2010, 01:07 AM
If you are prepared to rip the spine off the book, your task will be a lot easier, otherwise it is a lot of work.

It is far from being impossible though.

Lady Fitzgerald
08-05-2010, 02:23 AM
If you are prepared to rip the spine off the book, your task will be a lot easier, otherwise it is a lot of work.

It is far from being impossible though.

Not sure who you are addressing this to. On paperbacks, I just cut the spine off. Takes 15-30 seconds. On hardbacks, I have to cut the covers off before cutting the spines off. Cutting the covers off is easy, just run a knife over the "hinges" formed by the endpapers. If there is a corded ribbon (I forget the technical name and I'm too lazy to look it up right now) at each end of the spine, I rip those off. If the spine has been rounded, I can usually "break" it in several places by bending it back sharply which will usually let me flatten it enough to cut it off in the guillotine. If it is too stubborn to flatten or the book is too thick for the guillotine (it will handle only 1 1/2"), then I "split" the book into sections by "breaking" the spine and scoring it with a knife enough to let me snap it apart (same goes for really thick paperbacks). At worst, it only takes a minute or two to prepare a hardback for the guillotine. It takes about 15-30 seconds to actually cut the spine off.

nyrath
08-05-2010, 11:58 AM
Huge. Off the top of my head, I would say 15MB. Granted, that is much larger than typical e-books
But you do not have to proof-read. Which is no small consideration.

For the record, my OCR eBooks seem to average at about 0.5MB.

Lady Fitzgerald
08-05-2010, 12:59 PM
But you do not have to proof-read. Which is no small consideration...

True that!

[QUOTE=nyrath;1043589]For the record, my OCR eBooks seem to average at about 0.5MB.

That seems to be about average for the few e.pubs I do have.

Keep in mind reading the PDFs is a bit of an awkward compromise. A larger reader, like a tablet would be better but portability wins out in my case. A tablet won't fit in my purse but a reader will.

Iain
08-29-2010, 03:24 PM
Firstly, thanks for the comments I've read on on this forum and people who've answered my questions.


I've finally completed starting my digitising task! This whole thing has turned from a task into a fairly complex project, with a good deal of custom written software. And that's before I've digitised more than a few books!

I've blogged about this (horrid word and this is one of my first attempts at blogging) in some detail here (Iain's blog (http://iaindownsconsulting.spaces.live.com/blog/cns!EAFBB89B2261F6D8!128.entry)) but the short form goes like this.

I start off by cutting the spines off with a guillotine and counting the pages.

I've written a scanning program which talks to my Fujistu fi-6130. It captures the ISBN (bar code scanner or human entry) and finds the publication details (isbndb.com). I enter the subject and the number of pages and start the scan.

The program scans the first pages (the cover pages) in colour and the rest in monochrome. I do, of course, have to reload the hopper every minute or so, but that's quick and not too distracting. On completion, the tiff file (500MB - 2GB!) is queued for OCR and so on. If there are problems, then you can edit the tiff and delete pages or add new scanes.

The OCR processing side uses FineReader 10. I'm controlling FineReader through AutoHotKey so I don't have to interact with it. FineReader processes the document and saves it in word, html and text formats.

The word document is processed (again by a program of my own devising) and generates an ePub file which actually looks pretty good (though I say so myself).

Finally all the book details and the text are put in a database so that I can find books in a variety of ways.

That's the short form! The blog has a good deal more detail and I would welcome comments!

In particular, having spent a good deal of time writing code for this, I'm wondering if there is an opportunity to commercialise this.

Do you think people would be interested in a book digitisation service (I think I would have to charge about $2 a book and the book would be destroyed).

Do you think people would be interested in a more or less off the shelf system which could efficiently turn their mouldering paperbacks into prisine eBooks?

Let me know here or privately at iain AT idcl DOT co DOT uk

HarryT
08-30-2010, 06:22 AM
Do you think people would be interested in a book digitisation service (I think I would have to charge about $2 a book and the book would be destroyed).


I'm sure they would be, but I'm not sure about the legality of it in the UK. Format shifting is NOT legal here.

Lady Fitzgerald
08-30-2010, 10:17 AM
Firstly, thanks for the comments I've read on on this forum and people who've answered my questions.


I've finally completed starting my digitising task! This whole thing has turned from a task into a fairly complex project, with a good deal of custom written software. And that's before I've digitised more than a few books!

I've blogged about this (horrid word and this is one of my first attempts at blogging) in some detail here (Iain's blog (http://iaindownsconsulting.spaces.live.com/blog/cns!EAFBB89B2261F6D8!128.entry)) but the short form goes like this.

I start off by cutting the spines off with a guillotine and counting the pages.

I've written a scanning program which talks to my Fujistu fi-6130. It captures the ISBN (bar code scanner or human entry) and finds the publication details (isbndb.com). I enter the subject and the number of pages and start the scan.

The program scans the first pages (the cover pages) in colour and the rest in monochrome. I do, of course, have to reload the hopper every minute or so, but that's quick and not too distracting. On completion, the tiff file (500MB - 2GB!) is queued for OCR and so on. If there are problems, then you can edit the tiff and delete pages or add new scanes.

The OCR processing side uses FineReader 10. I'm controlling FineReader through AutoHotKey so I don't have to interact with it. FineReader processes the document and saves it in word, html and text formats.

The word document is processed (again by a program of my own devising) and generates an ePub file which actually looks pretty good (though I say so myself).

Finally all the book details and the text are put in a database so that I can find books in a variety of ways.

That's the short form! The blog has a good deal more detail and I would welcome comments!

In particular, having spent a good deal of time writing code for this, I'm wondering if there is an opportunity to commercialise this.

Do you think people would be interested in a book digitisation service (I think I would have to charge about $2 a book and the book would be destroyed).

Do you think people would be interested in a more or less off the shelf system which could efficiently turn their mouldering paperbacks into prisine eBooks?

Let me know here or privately at iain AT idcl DOT co DOT uk

You do not edit after OCR?

On average, how much time did you spend on each book.

Iain
08-31-2010, 07:10 AM
Thanks for your comment Harry - I'd not started this off as a commercial venture, so not researched. I see you are quite right and the whole thing is a complete mess.

It would appear, however, that I could manufacture and sell hardware and software shich 'format-shifted' books without infringing any law. The user of the equipment would be in breach (if they cared!) but not I.

There does seem some indication that the EU are moving, Leviathon-like, to some resolution of this and I may still be alive when they manage to get there!

Iain
08-31-2010, 07:36 AM
You do not edit after OCR?

On average, how much time did you spend on each book.


I do not edit after OCR. It's still early days and I'm refining the Word->ePub transformation. Also, it takes a good deal longer to READ the book than the whole rest of the process.

I'll report when I've read a dozen or so books, but so far I seem to have almost no character mis-recognitions. I'm talking of a handful in a book.

The other flaws I'm encountering may be artefacts of my word->ePub translation or of the OCR. I'm not sure which, yet. I'm expecting to be able to fix many of these either by fixing my code (:) ) or by applying a bit of intelligence to the process.

So far (and this is NOT statistically reliable), I'm seeing a missing space about every 4 pages, a space added after a correctly- hyphenated (sic!) term about as often and a line break in a paragraph every 10 pages or so (I think I know what's causing this and *may* be able to fix it).

Actually, I'm delighted with the quality, though as I mentioned in my post I'm not the best person to proofread things.

As far as time is concerned, I've been doing some Hammond Innes this morning. It took me about 13 minutes to trim a dozen books. They are almost consistently sized and quite thin (280 pages or so) so they are about the easiest of all books to slice.

I've scanned about two whilst I've been writing this. One of my main objectives is to be able to scan whilst I work. If there are no issues with the scan, then it takes probably a minute of my time for a book this size to scan (bar code) the ISBN, enter the pages (and subject) and feed the hopper.

Issues (I seem to be fumble fingered this morning! - I've been putting the covers in the wrong way round) add some minutes.

I bought a Thomas Hardy (for 5 pence!) at a car boot sale yesterday and plan to scan that and compare it to a gutenberg version to get a more formal comparison. At some point!

Hope this is interesting...

Lady Fitzgerald
08-31-2010, 10:12 AM
It is interesting for me since I'm in the process of digitizing my book collection.

No matter how good an OCR program may be, it will still take a fair amount of time to run. I have the version of ABBY Finereader that came with my Fujitsu ScanSnap s1500. I've only used it to give me searchable PDFs of tech magazines I have (obviously, no editing is required since there is no visible text generated other than the image of each page taken by the scanner). It takes around 30 minutes to an hour (I don't remember exactly) for the OCR to run on a 100 page magazine in addition to cutting and scanning the magazine (fortunately, I do not have very many magazines). Without OCR, I can scan, save, and catalogue 3-4 books per hour if I'm paying attention (usually I'm not; having ADD doesn't help). Since I have over 1500 books to do and want to finish before the end of the year, OCR just isn't an option, even without editing. I could always run my PDFs through OCR later but I don't plan on it. I'm able to easily read all but the largest books with the smaller print on a Jetbook Lite. Even the large page, small print books can be read without eyestrain on the JBL but it's a bit more awkward to scroll and good lighting becomes more critical. Using the JBL instead of a larger reader is a tradeoff to gain portability (it fits in my purse).

You said that your OCR process has few errors. How well does it deal with page headers and footers and page numbers? How about drop caps at the beginning of a sentence? Some of those use pretty intricate, decorative fonts. How about when fonts change within a book, such as bold text or italics? Is your OCR process able to replicate or accurately read those? Often, certain passages in a book have increased margins to denote a quoted passage, such as a paragraph from a letter. How does that get handled? Many fonts used in books have charaters that are similar or identical to others, such as the upper and lower case j being identical or the letters l and I being similar to each other and the number 1 (sometimes even identical). How well is that handled? How do images get handled? You said you can tolerate some mistakes. How many is some? Unfortunately, I would find any mistakes very distracting and annoying. For me editing would take about as long as would take to read the book. I can't spare even 30-60 minutes just run the OCR because of the large number of books I have and limited time available, even considering I'm retired now.

I wish getting an occasional cover wrong way around was my only operator error. I have been known to insert a set of pages in the ADF the wrong way. If the pages were merely upside down, it would be easy to correct in Adobe Acrobat 9 but if I get the order reversed, it's much faster to rescan those pages, then replace the incorrect pages with the newly scanned ones, again using Acrobat.

How many cuts have you made with your guillotine? Mine broke after only 250 books. Although I'm currently doing battle with Amazon over it since the guillotine they sold me apparently is an inferior knock off, I would consider spending the extra money to get a more reliable one.

My guillotine has a different clamping mechanism than yours but the fence is the same as yours. I also had problems trying to figure out where to set it because of no easy way to see where the cut will occur. I found the easiest way to align the fence (which also kept my fingers away from that vicious cutter blade) was to leave the blade dropped after the previous cut (I also store it that way), slip the book into place with the spine against the blade, lower the clamp until it lightly touches the book (but still allows free movement), push the fence tightly against the book until the pages are flush with the fence face, then tighten the clamp on the fence. I then raise the blade and lock it, push the book away from the fence slightly, slip a shim or two (thin pieces of cardboard; the number and thickness based on previous experience) between the fence and the book, then pull the book back against the fence. I then tighten the clamp a bit more, use a thin tool to gently bump the spine snug against the spine (the idea of the tool is to avoid getting my fingers near the blade; I almost lost the tip of a thumb to it when I first got it), then finish tightening the clamp and make the cut. I found this procedure goes quickly, is safe, and is more accurate than trying to eyeball where cut is going to take place.

If a book has a very curved spine and the gutter margin is too small to comfortably accomodate the curvature when cutting the spine off, on hard backs (I strip the cover off hardbacks before cutting to avoid excessively stressing the guillotine), I try "breaking" the spine by folding it sharply back in several places to try and make it easier to flatten the spine. If that doesn't work (and on paper backs), I cut the book apart into several smaller pieces, which minimizes the curvature of each section of book, then cut each piece one at a time.

Iain
09-01-2010, 05:10 AM
OK. Lots of questions there. I'll try and get answers to all in.

Firstly, my books are mainly of the 'pulp fiction' variety so tend to be light on posh formatting. I'm also still tuning the whole process so there's the what is being done and what can be done.

For a paperback book the OCR process takes roughly the same time as the scanning process. Somewhere between 4 and 10 minutes. That is with the latest FineReader running on a quadcore machine, so I can see how it could get to be 30 mins on an older machine with an older version.

The system I've written makes the processing automatic so I can do it on another machine or even overnight.

The OCR does a good job of italic and bold changes. It should do well for margin changes (the information is there in the word doc), though I've not yet processed (or at least proofed) a book which uses this.

I think there are around half a dozen character misreads in the 300 page book I've just 'proofed' (though my disclaimers are about my proofing skills remains!).

The more complex stuff which happens before and after the book (with decorative fonts and mixed up with graphics) can be a mess, so I would imagine anything complex in the middle will also be a mess. I'll look at dealing with the messes as I come across them!

I actually deliberately discard headers and footers. If you want pages to reflow as font sizes change then they aren't helpful. Having said that you've just make me realise I can use them to enhance chapter detection.

I suspect that I've been lucky with the books I've proofed so far and I also suspect I have a higher level of tolerence for errors!

Thanks for the advice on the guillotine. That all sounds like a good deal of sense - I too have lightly touched the blade (I had to remove the guard to see what is going on) and found it astonishingly sharp! I wish my kitchen knives were that sharp.

I suppose I have it in mind that if there are serious problems in a book I can go back to the original and tweak the OCR. I've also thought about writing an editing eBook reader for the iPad to tweak the minor errors. However, I doubt I will ever have the time or energy to do this.

In a couple of weeks I'll have a much better idea of the quality and will keep you posted on what I discover!

Iain

Lady Fitzgerald
09-01-2010, 06:20 AM
Thanks, Iain.

Iain
09-02-2010, 04:49 AM
Just an update on that.

What I seem to be seeing is that character recognition is very accurate and most of the errors with spaces and line feeds I'm seeing are bugs in my conversion to ePub.

It also handles italics and bold and font size changes well.

However, at the moment it does not spot section indentation or justification changes. So some 'poems' are not inset and chapter headings not centered.

I may be able to get round this by using the more formatted output as a source, but haven't tried yet.

I'll keep you posted.

Iain