09-29-2014, 01:43 AM | #1 |
Enthusiast
Posts: 43
Karma: 28554
Join Date: Mar 2013
Device: Kindle Keyboard, KPW2
|
Converting a scanned book from 1DollarScan to ePub
Hello guys,
This is a sample page from one of the books scanned using 1DollarScan (600 dpi): https://www.dropbox.com/s/j18r16ed7t...0Page.pdf?dl=0 I was thinking of trying Custom Book Scanning for the following reasons: 1. They offer ePub/MOBI for $10 more. 2. Their PDF scan is supposedly 1200 dpi. I saw posts of users here trying to convert their PDF to ePub by first converting it to HTML by Abbyy Fine Reader. Here's that page converted to HTML (Please refer to attachment). 1. Based on the results, I feel that an ePub would be terrible for my book. 2. Also, I read that scanning with higher DPI hurts OCR. Is that true? The main usage of these eBooks are just for text searching. I would have hard copies of the same books. Would really appreciate any comments on this.. So sorry for the long post! Last edited by adrenaline; 09-29-2014 at 01:48 AM. |
09-29-2014, 01:56 AM | #2 |
Resident Curmudgeon
Posts: 73,983
Karma: 128903378
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
No matter how you go from PDF > ePub, you have to A/B compare the PDF to the ePub. You have A/B compare every character, every space, every punctuation mark, EVERYTHING in order to make 100% sure your ePub has no errors added by the conversion.
I've seen too many PDF > ePub conversion where you know the source was PDF and the errors are due to the conversion. |
Advert | |
|
09-29-2014, 02:04 AM | #3 |
Enthusiast
Posts: 43
Karma: 28554
Join Date: Mar 2013
Device: Kindle Keyboard, KPW2
|
Thanks a lot JSWolf.
What do you think about the 1200 dpi scanning compared to 600? Thanks again. |
09-29-2014, 02:34 AM | #4 | |
Wizard
Posts: 1,358
Karma: 5766642
Join Date: Aug 2010
Device: Nook
|
Quote:
Don't most book scanning services include an OCR option? It's a heck of a lot easier to covert a Word file to epub than PDF. |
|
09-29-2014, 03:05 AM | #5 |
eBook Enthusiast
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
|
Moved to the "Workshop" forum.
|
Advert | |
|
09-29-2014, 04:32 AM | #6 |
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
Sorry, not completely true. For most books 300dpi would be sufficient, but it really depends on the source. I scan at 400dpi and get a lot less OCR errors, especially with older pockets and paperback. Everything over 600dpi would be overkill. Downside is the decrease in scanning speed. I find that 400dpi is a good tradeoff.
|
09-29-2014, 05:53 AM | #7 |
I am what I am
Posts: 6,625
Karma: 62235665
Join Date: Sep 2011
Device: iPad3, Voyage
|
Hi adrenaline
I routinely buy out of print ebooks and send them for scanning. Based on my own experience: 1. No scanning service can convert a pdf into a decent epub/mobi. There are too many OCR errors in scanning to even consider this, so save the $10. 2. A pdf scan of 1200 dpi for an ebook is overkill and just produces a monstrously large file that will choke most programs. 3. I buy the books to read, so my workflow is to convert the pdf to html in Abbyy Finereader and then to convert the html to epub using Sigil. I have a pretty good idea of what to look for now, so the whole process is not that tedious and time consuming. My accuracy rate is about 95%, which is sufficient for me since I do the conversion for my own use only (I'd rather spend the time reading instead of comparing every single character). If you only need the ebooks to search text, would not a simple scan to pdf with OCR work? Why would you need to further convert them to epub/mobi? |
09-29-2014, 06:02 AM | #8 | |
eBook Enthusiast
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
|
Quote:
|
|
09-29-2014, 06:29 AM | #9 |
I am what I am
Posts: 6,625
Karma: 62235665
Join Date: Sep 2011
Device: iPad3, Voyage
|
1 character in 20? I think not. Anyway 95% was just a guess because I rarely have more than 20 errors in an entire book I just wanted to stress that I'm happy with the results (and time saved) not comparing character for character.
|
09-29-2014, 06:31 AM | #10 |
eBook Enthusiast
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
|
Then you have an accuracy rate of well over 99.9%, which is what one would expect from a decent OCR program like Abbyy Finereader. 95% accuracy for OCR would be truly abysmal .
|
09-29-2014, 07:05 AM | #11 |
Color me gone
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
|
The main problem I see with super high dpi is that every smudge, dot, etc on the page becomes an character you have to get rid of. The only benefit at all might be for pictures if there are lots of them and they are very high quality. I don't think this is the common situation for out of print books, and if you are talking about eink, a colossal waste of time since the resolution is so low.
|
09-29-2014, 12:09 PM | #12 |
Fanatic
Posts: 563
Karma: 403106
Join Date: Aug 2014
Device: PRS-T1
|
I've seen a lot of scanned books in my life.
Frankly, I would rather type them by hand than to correct their spelling mistakes and/or paginations. I believe a lot of the people that answered are English natives. Well, any OCR software can be trained to recognize 26 letters, but to non-ASCII users (like Bangla above) the errors a ten fold increased. For diacritics, it even be that scanning errors (like random black dots) may create a new character. A good example of my opinion can be found in archive.org. Compare the PDF (scanned but a text layer) and the EPUB files. |
09-29-2014, 09:04 PM | #13 | |
Bookmaker & Cat Slave
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
I certainly would not consider typing a book instead of scanning it. No offense, but I find the idea crazy. Take a high-quality scan, a good A/B, run it through Toxaris' program, and you have a very, very high quality starting place. The problem we see on these forums--all the time--is that nobody ever wants to do the "grunty" work of correcting the scanned material. Everybody wants a magic bullet. It doesn't exist. Hitch |
|
09-30-2014, 06:41 AM | #14 |
Color me gone
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
|
I have seen it over and over again that a mistaken scan will produce perfectly plausible and grammatically correct, but wrong, output.
|
09-30-2014, 06:51 AM | #15 |
eBook Enthusiast
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
|
Precisely. Errors such as "dock" instead of "clock", "comer" instead of "corner", etc, are commonplace, and spell-checkers won't find them. The only way to find such errors (and I must politely disagree with Hitch's assertion that nobody does so ) is to do a word by word manual comparison of the original document with the OCR'd text. This is extremely labour-intensive: I've had years of practice at it, and I reckon I can proof-read around about 15 pages an hour with a typical novel, so that would be about 33h work for a 500-page book.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Any ever use a book scanning service like 1dollarscan.com? | apastuszak | General Discussions | 6 | 06-22-2014 10:38 AM |
Converting large book from azw3 to epub failes | gameman | Conversion | 5 | 12-15-2013 09:10 AM |
truncation problem converting mobi book to epub | Joe9O | Conversion | 3 | 02-08-2013 10:40 AM |
Converting from a 1DollarScan pdf (saved as word doc) | BeccaPrice | Conversion | 4 | 01-07-2013 08:14 AM |
scanned book to epub | langmarp | General Discussions | 3 | 06-28-2010 08:44 AM |