02-09-2012, 02:36 PM | #1 |
Junior Member
Posts: 7
Karma: 10
Join Date: Feb 2012
Location: Florida USA
Device: Kindle 4 SO (Died), Kindle Fire HD 7"
|
Best format for scanned books?
I'm a bit new to the ebook world. I'm trying to make a few ebooks from some obscure, out-of-print books. The original formatting runs the gamut: All text, mostly text with some B&W line art (a map, etc.), a book with greyscale photos and text, to a coffee table type book with lots of colour photos inside the text. The text is sometimes standard, single column; but there are at least two books with double column text and one with triple column! This is a nightmare.
The books were all scanned on a flatbed scanner a few years ago, saved to TIFF files. Some are single page, some are double page. I must say that some pages turned out OK, but some are horrible and require manual tweaking. I discovered Scan Tailor a few days ago and have run it on a few books which are good examples of the headaches above. I LOVE Scan Tailor! What I had started out trying to do a page at a time in Photoshop 5 a few years ago with no experience ST did in minutes across an entire book. So, few problems now, though. ST's output TIFFS for the colour photo coffee table book was over 10 times the file size of the originals (orig. about 8 MB, output was ~80-100 MB or more, per file). Everything set to 600 dpi, as that was what they were scanned at, IIRC. I had to run them through PIXresizer to get them to a manageable size again. The B&W output files were wonderful, though. Not sure where to go from here. I thought so long on how to get the photos retouched that I never considered what to do once they were done! The obvious thing to do is to PDF them at this point, but I'm unsure about that. I would really like to read some of these books on my Kindle 4, so immediately going to PDF now isn't the best option. Shall I OCR, and spend a month proofreading? I'm also very concerned about being able to use the new ebooks in the future with little to no additional manual labour done to them. I'm also trying to get this done as quickly as possible, with as few steps as possible, but with good quality - archive quality not necessary, but close to it is the goal. I am using a Win7 PC to do this, with Scan Tailor "enhanced" 0.9.11pre, Adobe Acrobat 9 Pro Extended, and I have the latest Calibre 0.8.3x. |
02-09-2012, 04:51 PM | #2 |
Linux User
Posts: 2,279
Karma: 6123806
Join Date: Sep 2010
Location: Heidelberg, Germany
Device: none
|
If you want the best possible result - yes.
But if the scans are very good and you have a decent OCR software (like ABBYY FineReader) you probably won't need all that much proof reading. If the scans are horrible, obtaining better scans might actually be less work than fixing the errors that are caused by bad quality images... Also I proof read while I read the books. I have an old fashioned pencil and paper next to me and when I spot an error I just write it down. So next time I'm on the PC I can just correct those errors I found (and if its a recurring error, correct all others like it too, if I can find a search&replace regexp that works). Only worthwhile if it's a book that you might read again after some time though. For royalty free books you could just share and ask others to report errors they find back to you ... |
Advert | |
|
02-10-2012, 04:11 AM | #3 | |
Evangelist
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
|
Pencil and paper? My god... Use the reader's highlight function, for crying out loud!
Scan Tailor outputs uncompressed TIFFs and that's why they take up so much space. But don't resize them! Resizing can add antialiasing (blurriness), which doesn't compress well. If you run them through Acrobat (reduced PDF then optimized PDF) they will get compressed. Just be careful not to choose lossy compression because TIFF can support either lossless or lossy. Quote:
Last edited by DSpider; 02-10-2012 at 06:16 AM. |
|
02-10-2012, 09:52 AM | #4 | |
Zealot
Posts: 128
Karma: 238654
Join Date: Aug 2009
Device: Kobo Mini (4GB), Nook Classic wi-fi, iPod Touch (Bluefire Reader)
|
Quote:
For archiving I would use XHTML. Yeah OCRing can be a pain, but it only takes me a couple of hours of work to get the OCR'd text fairly clean on the PC (would probably be less if I took the time to build a decent scanning cradle), and a quick read-through on my reader to catch the rest. |
|
02-13-2012, 08:44 PM | #5 |
Junior Member
Posts: 7
Karma: 10
Join Date: Feb 2012
Location: Florida USA
Device: Kindle 4 SO (Died), Kindle Fire HD 7"
|
I was not aware about the issues with resizing the TIFFs. Thanks for the tip - I'll refrain in future, but what's done is now done, as I had to delete the large files due to hard drive space - or lack thereof.
Yeah, It's a all-new and improved "format wars" all over again. At least with the VHS/Beta war, the Netscape/IE war, the Blu-Ray/HD-DVD war, they were only a two player war. This one is multiplayer, and if you choose wrong you're screwed good. I personally am starting to HATE PDF. It's SO easy to corrupt the files by accident, and they're unusable then. I have so many PDF's that are nothing but graphics and photos inside, they'll be a major pain to convert to anything readable. I'd seriously consider archiving with XHTML, but I know less about that than other formats. There are so many HTML derivations today that I can't keep up. Would XHTML be able to handle the line art and other non-text problems? What do I use to create/edit/view XHTML files? Do these convert to ePub and Mobi well? |
Advert | |
|
02-13-2012, 09:06 PM | #6 |
Grand Sorcerer
Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
|
ePub is nothing more than XHTML files, image files and some metadata inside a zip container. It is a good format for keeping your archives since it is compressed and contains all the needed items in one file. It can be edited by taking it apart of directly using a suitable ePub editing program. See the mobileread wiki for technical details on all things ebook.
|
02-13-2012, 09:22 PM | #7 |
Grand Sorcerer
Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
|
By the way the format wars as pretty well settled to two basic formats, Amazon and everybody else. Basically Mobi vs. ePub with PDF still used when a fixed format is needed although several people are using ePub variants even for that. Of course there is still the problem with DRM as there is no standard way to do that, unlike video formats.
|
02-14-2012, 09:49 AM | #8 |
Chocolate Grasshopper ...
Posts: 27,600
Karma: 20821184
Join Date: Mar 2008
Location: Scotland
Device: Muse HD , Cybook Gen3 , Pocketbook 302 (Black) , Nexus 10: wife has PW
|
Of course, just scanning and saving as pdf files would save an awful lot of work :
|
02-25-2012, 07:57 PM | #9 | |
Addict
Posts: 340
Karma: 43106
Join Date: Apr 2009
Location: Germany
Device: BeBook One, Pocketbook Touch, Pocketbook Touch HD
|
Quote:
My Bebook One is already a few years old and slow compared to current e-readers, but it is still good enough for reading. By the way, many people are surprised when they see me writing down something from an ebook. Some even make fun of me, saying that I need as much paper as for a printed book. |
|
02-29-2012, 04:37 AM | #10 |
BioReader
Posts: 292
Karma: 42568
Join Date: Apr 2009
Location: Germany
Device: Various
|
Depends on your objective:
- if you have an old (really old) book and want to conserve it this way -> tif or pdf with facsimile-like images embedded : readable on pc/tablet/10"ereader. Chance is good that you can still read it in a few years - if you want to read immediately -> cleaning and fast screening for errors -> searchable pdf : readable on pc/tablet/10"ereader (if not to avoid - 6") - if you are the "editor" type of person with a sense for aesthetics -> cleaning, thorough editing, layouting and publishing to epub or whatever : readable on all current ereaders. What happens in a few years? This sequence also represents a scale of effort Klaus |
03-03-2012, 09:26 AM | #11 |
Addict
Posts: 340
Karma: 43106
Join Date: Apr 2009
Location: Germany
Device: BeBook One, Pocketbook Touch, Pocketbook Touch HD
|
Personally, I keep all the books that I scanned as xhtml. I think it is very unlikely that xhtml-format will cease to work from one day to another.
|
03-04-2012, 06:37 AM | #12 |
eBook Enthusiast
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
|
It's a very good idea to keep the original page scans - eg as PDF - so that they can be used for proofing against. OCR is far from perfect.
|
03-04-2012, 04:06 PM | #13 |
Banned
Posts: 132
Karma: 566638
Join Date: Aug 2011
Location: Wouldn't you like to know.
Device: Sony PRS-350:Sony PRS-T1:Rooted Nook Tablet
|
Based on the consensus of this board, if you own the tree copy of the book it is okay to download a copy of the e-book...regardless of the source. So why go to the trouble of scanning, proofing, and formatting when the book is probably already out there somewhere? Just a question...
|
03-04-2012, 04:10 PM | #14 |
eBook Enthusiast
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
|
I really don't think you're right in saying that that's a "consensus"? Most people would, I think, call that "piracy". Owning a paper book certainly doesn't entitle you to a "free" eBook, any more than owning a hardback entitles you to a free paperback.
|
03-04-2012, 07:56 PM | #15 | |||
Banned
Posts: 132
Karma: 566638
Join Date: Aug 2011
Location: Wouldn't you like to know.
Device: Sony PRS-350:Sony PRS-T1:Rooted Nook Tablet
|
Quote:
On one particular poll almost 65% of the people said they 'pirate' books they have in tree format currently. Quote:
Quote:
Those two reasons apply to the OP, they have a book that A) They own and B) it is not available in electronic format. Using that as a basis for my question, I merely asked 'why go to the trouble of scanning, proofing, and formatting when the book is probably already out there somewhere?' |
|||
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Story HD and Google Books scanned free books | wilsonch | iRiver Story | 8 | 12-14-2011 10:23 PM |
Scanned books to Epub, best software? | Student1 | Workshop | 4 | 02-27-2009 03:08 PM |
Small scanned books | Paul Moews | iRex | 22 | 02-05-2009 05:58 PM |
Ok I have scanned pdf books....but | DeathtoToasters | Sony Reader | 38 | 11-04-2008 07:51 PM |
Scanned books - a rant | FuzzyGamer | Sony Reader | 31 | 04-01-2008 03:39 PM |