Best format for scanned books? - Page 2

AnemicOak · 03-04-2012, 08:10 PM

Quote:

Originally Posted by jmaejr

That SEEMS to be the consensus...at least the majority opinion here.

Just keep in ming only 195 people participated in that poll.

jmaejr · 03-04-2012, 09:11 PM

That is probably the number of people that are truly active on this forum...either way HarryT responded it was basically okay given those circumstances and I merely asked why go to the trouble when one of the boards 'highest rated' members gives it a green light.

I wonder how many of the TOTAL members of this forum have the Harry Potter series in some e-format...

TechSarge · 04-03-2012, 05:22 PM

As the OP, I'd like to give an update:

Finished my first book a couple of weeks ago. It's a paperback of which there is no e-copy available (BTW folks, in this instance, scanning a book which you already own isn't piracy, it is fair use and legal. Same as making a backup copy of a music CD you own, or ripping said CD to MP3.).

I scanned all pages to TIFs, using an ancient Lexmark X1100 series AIO scanner I have here (I was very careful with the book, as I don't like flattening it out on that flatbed scanner). Pages were run through ScanTailor to straighten out any misaligned scans and to cut the double pages apart. Pages were then run through Adobe Acrobat 9 Enhanced's OCR function, with Clear Scan enabled. The OCR output was saved as html, as I didn't know how to save as xhtml then (do now). Files were then opened in Sigil, for editing, proofreading, etc.

I have to say that for this particular book, Acrobat's OCR engine sucks. It took me probably 36 hours of proofing to fix everything, as I had to read and re-read the book to catch all of the errors - everything from a single wrong letter in a word, to entire sentences missing from the text. Forget about italics, they were always wrong or nonexistent.

A few things I'd like to change:

Sigil did a good job formatting the things I thought it would choke on, such as the map at the beginning of the book. It did choke on line drawings at the beginning of each chapter, though, so I had to cut n' paste one from one of the original scans as a bitmap and use that for each chapter. Ugly, but worked.

The gobs of extra lines in the text has to go. Thankfully, I found out how to deal with this in Calibre. Along with paragraph indentation. Sigil has no capacity for this, and it's a serious oversight, as it's touted as a friggin' editor! In this day and age, one shouldn't need to go into the code to do such obvious tweaking.

Sigil changes things in the book once you save it. I saved changes to Chapter Two FOUR TIMES (a simple justify center of the word "TWO" in the beginning of the file). Each time when I opened the book on my device, "TWO" was justify left instead of center. As it is the last noticeable error in the book, I said "screw it" and am leaving it as-is, as I'm not going to mess with it anymore.

If I didn't have the paper copy of the book here to proof the OCR against, I couldn't have finished this sane (and this was only a 250 page paperback!). If I was working solely from original scans, on only a laptop and not a multiple monitor setup, the constant flipping back and forth would have driven me nuts. The next few books are going to be much more challenging, with triple column text on each page, and/or lots of inserted line art or photos. The fonts are a lot older as well, which will (I'm sure) give Acrobat's OCR even more fits. I gotta either figure out how to improve Acrobat's accuracy, or get a different OCR engine.

I am very, very proud of the job I did on this e-book, though. It is as attractive to look at and to read as any commercially published e-book I've read.

Suggestions as to better software or changes to workflow are quite welcome. I'm starting on my second project very soon.

Keroberos · 04-03-2012, 09:47 PM

Quote:

Suggestions as to better software or changes to workflow are quite welcome. I'm starting on my second project very soon.

I would definitely recommend switching OCR software (Acrobat's OCR sucks). I use ABBYY FineReader Professional--$170, but worth every penny in my opinion (with training, I don't think I spend more than an hour or two spell checking). They have a cheaper express version for $50, but I don't know how good it is. There are free OCR programs out there, can't say how good or user friendly they are (I tried Tesseract with a GUI front-end but gave up).

For scanning, I use digital camera based rigs like those described here, one for hardcovers and one for paperbacks and small hardcovers. I then batch crop the images with JPEGCrops, then process the images with Scan Tailor, OCR with Finereader, export the text as html, clean all the junk code that FineReader can add (and I'm sure Acrobat does too) with Toxaris's excellent Word macro. Then I format the cleaned html into an epub with Sigil.

Toxaris · 04-04-2012, 02:28 AM

Quote:

Originally Posted by TechSarge

Sigil changes things in the book once you save it. I saved changes to Chapter Two FOUR TIMES (a simple justify center of the word "TWO" in the beginning of the file). Each time when I opened the book on my device, "TWO" was justify left instead of center. As it is the last noticeable error in the book, I said "screw it" and am leaving it as-is, as I'm not going to mess with it anymore.

Centering text can be a bit cumbersome sometimes, but usually that is due to the reading software. Sigil does some sanity checks before saving. If you use a style with the attribute 'text-align: center' it should work.

I personally save two formats. One ePUB, since it is an archive with files in an open format and one PDF/A. The PDF/A contains both the scanned file and the OCR-ed text. That makes it easy to search for text and being able to see the original image.

03-04-2012, 09:11 PM	#17
jmaejr Banned Posts: 132 Karma: 566638 Join Date: Aug 2011 Location: Wouldn't you like to know. Device: Sony PRS-350:Sony PRS-T1:Rooted Nook Tablet	That is probably the number of people that are truly active on this forum...either way HarryT responded it was basically okay given those circumstances and I merely asked why go to the trouble when one of the boards 'highest rated' members gives it a green light. I wonder how many of the TOTAL members of this forum have the Harry Potter series in some e-format... Last edited by jmaejr; 03-04-2012 at 09:32 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Story HD and Google Books scanned free books	wilsonch	iRiver Story	8	12-14-2011 10:23 PM
Scanned books to Epub, best software?	Student1	Workshop	4	02-27-2009 03:08 PM
Small scanned books	Paul Moews	iRex	22	02-05-2009 05:58 PM
Ok I have scanned pdf books....but	DeathtoToasters	Sony Reader	38	11-04-2008 07:51 PM
Scanned books - a rant	FuzzyGamer	Sony Reader	31	04-01-2008 03:39 PM

04-03-2012, 05:22 PM	#18
TechSarge Junior Member Posts: 7 Karma: 10 Join Date: Feb 2012 Location: Florida USA Device: Kindle 4 SO (Died), Kindle Fire HD 7"	As the OP, I'd like to give an update: Finished my first book a couple of weeks ago. It's a paperback of which there is no e-copy available (BTW folks, in this instance, scanning a book which you already own isn't piracy, it is fair use and legal. Same as making a backup copy of a music CD you own, or ripping said CD to MP3.). I scanned all pages to TIFs, using an ancient Lexmark X1100 series AIO scanner I have here (I was very careful with the book, as I don't like flattening it out on that flatbed scanner). Pages were run through ScanTailor to straighten out any misaligned scans and to cut the double pages apart. Pages were then run through Adobe Acrobat 9 Enhanced's OCR function, with Clear Scan enabled. The OCR output was saved as html, as I didn't know how to save as xhtml then (do now). Files were then opened in Sigil, for editing, proofreading, etc. I have to say that for this particular book, Acrobat's OCR engine sucks. It took me probably 36 hours of proofing to fix everything, as I had to read and re-read the book to catch all of the errors - everything from a single wrong letter in a word, to entire sentences missing from the text. Forget about italics, they were always wrong or nonexistent. A few things I'd like to change: Sigil did a good job formatting the things I thought it would choke on, such as the map at the beginning of the book. It did choke on line drawings at the beginning of each chapter, though, so I had to cut n' paste one from one of the original scans as a bitmap and use that for each chapter. Ugly, but worked. The gobs of extra lines in the text has to go. Thankfully, I found out how to deal with this in Calibre. Along with paragraph indentation. Sigil has no capacity for this, and it's a serious oversight, as it's touted as a friggin' editor! In this day and age, one shouldn't need to go into the code to do such obvious tweaking. Sigil changes things in the book once you save it. I saved changes to Chapter Two FOUR TIMES (a simple justify center of the word "TWO" in the beginning of the file). Each time when I opened the book on my device, "TWO" was justify left instead of center. As it is the last noticeable error in the book, I said "screw it" and am leaving it as-is, as I'm not going to mess with it anymore. If I didn't have the paper copy of the book here to proof the OCR against, I couldn't have finished this sane (and this was only a 250 page paperback!). If I was working solely from original scans, on only a laptop and not a multiple monitor setup, the constant flipping back and forth would have driven me nuts. The next few books are going to be much more challenging, with triple column text on each page, and/or lots of inserted line art or photos. The fonts are a lot older as well, which will (I'm sure) give Acrobat's OCR even more fits. I gotta either figure out how to improve Acrobat's accuracy, or get a different OCR engine. I am very, very proud of the job I did on this e-book, though. It is as attractive to look at and to read as any commercially published e-book I've read. Suggestions as to better software or changes to workflow are quite welcome. I'm starting on my second project very soon.