Converting a scanned book from 1DollarScan to ePub

adrenaline · 09-29-2014, 01:43 AM

Hello guys,

This is a sample page from one of the books scanned using 1DollarScan (600 dpi):

https://www.dropbox.com/s/j18r16ed7t...0Page.pdf?dl=0

I was thinking of trying Custom Book Scanning for the following reasons:

1. They offer ePub/MOBI for $10 more.

2. Their PDF scan is supposedly 1200 dpi.

I saw posts of users here trying to convert their PDF to ePub by first converting it to HTML by Abbyy Fine Reader.

Here's that page converted to HTML (Please refer to attachment).

1. Based on the results, I feel that an ePub would be terrible for my book.

2. Also, I read that scanning with higher DPI hurts OCR. Is that true?

The main usage of these eBooks are just for text searching. I would have hard copies of the same books.

Would really appreciate any comments on this.. So sorry for the long post!

JSWolf · 09-29-2014, 01:56 AM

No matter how you go from PDF > ePub, you have to A/B compare the PDF to the ePub. You have A/B compare every character, every space, every punctuation mark, EVERYTHING in order to make 100% sure your ePub has no errors added by the conversion.

I've seen too many PDF > ePub conversion where you know the source was PDF and the errors are due to the conversion.

adrenaline · 09-29-2014, 02:04 AM

Thanks a lot JSWolf.

What do you think about the 1200 dpi scanning compared to 600?

Thanks again.

taustin · 09-29-2014, 02:34 AM

Quote:

Originally Posted by adrenaline

Thanks a lot JSWolf.

What do you think about the 1200 dpi scanning compared to 600?

Thanks again.

Most OCR packages have a recommended range of resolution, including a maximum. Whether or not (or how much) it hurts the process is debatable, and likely depends on a lot of factors. But beyond about 300 dpi, it doesn't really help.

Don't most book scanning services include an OCR option? It's a heck of a lot easier to covert a Word file to epub than PDF.

HarryT · 09-29-2014, 03:05 AM

Moved to the "Workshop" forum.

Toxaris · 09-29-2014, 04:32 AM

Quote:

Originally Posted by taustin

Whether or not (or how much) it hurts the process is debatable, and likely depends on a lot of factors. But beyond about 300 dpi, it doesn't really help.

Sorry, not completely true. For most books 300dpi would be sufficient, but it really depends on the source. I scan at 400dpi and get a lot less OCR errors, especially with older pockets and paperback. Everything over 600dpi would be overkill. Downside is the decrease in scanning speed. I find that 400dpi is a good tradeoff.

JoHunt · 09-29-2014, 05:53 AM

Hi adrenaline

I routinely buy out of print ebooks and send them for scanning. Based on my own experience:

1. No scanning service can convert a pdf into a decent epub/mobi. There are too many OCR errors in scanning to even consider this, so save the $10.

2. A pdf scan of 1200 dpi for an ebook is overkill and just produces a monstrously large file that will choke most programs.

3. I buy the books to read, so my workflow is to convert the pdf to html in Abbyy Finereader and then to convert the html to epub using Sigil. I have a pretty good idea of what to look for now, so the whole process is not that tedious and time consuming. My accuracy rate is about 95%, which is sufficient for me since I do the conversion for my own use only (I'd rather spend the time reading instead of comparing every single character).

If you only need the ebooks to search text, would not a simple scan to pdf with OCR work? Why would you need to further convert them to epub/mobi?

HarryT · 09-29-2014, 06:02 AM

Quote:

Originally Posted by joehunt

3. I buy the books to read, so my workflow is to convert the pdf to html in Abbyy Finereader and then to convert the html to epub using Sigil. I have a pretty good idea of what to look for now, so the whole process is not that tedious and time consuming. My accuracy rate is about 95%, which is sufficient for me since I do the conversion for my own use only (I'd rather spend the time reading instead of comparing every single character).

An OCR accuracy rate of 95% is an error in one character in 20, or about 1 in every 4 words, which is pretty appalling. Decent OCR should give you accuracy of about 99.9%, or 1 character in 1000, or about 1 in every 200 words (roughly 1 error per page).

JoHunt · 09-29-2014, 06:29 AM

1 character in 20? I think not. Anyway 95% was just a guess because I rarely have more than 20 errors in an entire book

I just wanted to stress that I'm happy with the results (and time saved) not comparing character for character.

HarryT · 09-29-2014, 06:31 AM

Quote:

Originally Posted by joehunt

1 character in 20? I think not. Anyway 95% was just a guess because I rarely have more than 20 errors in an entire book

Then you have an accuracy rate of well over 99.9%, which is what one would expect from a decent OCR program like Abbyy Finereader. 95% accuracy for OCR would be truly abysmal

.

mrmikel · 09-29-2014, 07:05 AM

The main problem I see with super high dpi is that every smudge, dot, etc on the page becomes an character you have to get rid of. The only benefit at all might be for pictures if there are lots of them and they are very high quality. I don't think this is the common situation for out of print books, and if you are talking about eink, a colossal waste of time since the resolution is so low.

Ghitulescu · 09-29-2014, 12:09 PM

I've seen a lot of scanned books in my life.
Frankly, I would rather type them by hand than to correct their spelling mistakes and/or paginations.

I believe a lot of the people that answered are English natives. Well, any OCR software can be trained to recognize 26 letters, but to non-ASCII users (like Bangla above) the errors a ten fold increased. For diacritics, it even be that scanning errors (like random black dots) may create a new character.

A good example of my opinion can be found in archive.org. Compare the PDF (scanned but a text layer) and the EPUB files.

Hitch · 09-29-2014, 09:04 PM

Quote:

Originally Posted by Ghitulescu

I've seen a lot of scanned books in my life.
Frankly, I would rather type them by hand than to correct their spelling mistakes and/or paginations.

I believe a lot of the people that answered are English natives. Well, any OCR software can be trained to recognize 26 letters, but to non-ASCII users (like Bangla above) the errors a ten fold increased. For diacritics, it even be that scanning errors (like random black dots) may create a new character.

A good example of my opinion can be found in archive.org. Compare the PDF (scanned but a text layer) and the EPUB files.

Incompetent scanning and OCR will always result in poor-quality output. A good scanner, with competent OCR, can achieve a 99.995% rate. That's imperfect, but not bad. Of course, $1/scan and that ilk aren't going to give you a 99.995, because they're not running human A/B compares, which is, realistically, the only way to get to that level of quality. {shrug}.

I certainly would not consider typing a book instead of scanning it. No offense, but I find the idea crazy. Take a high-quality scan, a good A/B, run it through Toxaris' program, and you have a very, very high quality starting place.

The problem we see on these forums--all the time--is that nobody ever wants to do the "grunty" work of correcting the scanned material. Everybody wants a magic bullet. It doesn't exist.

Hitch

mrmikel · 09-30-2014, 06:41 AM

I have seen it over and over again that a mistaken scan will produce perfectly plausible and grammatically correct, but wrong, output.

HarryT · 09-30-2014, 06:51 AM

Quote:

Originally Posted by mrmikel

I have seen it over and over again that a mistaken scan will produce perfectly plausible and grammatically correct, but wrong, output.

Precisely. Errors such as "dock" instead of "clock", "comer" instead of "corner", etc, are commonplace, and spell-checkers won't find them. The only way to find such errors (and I must politely disagree with Hitch's assertion that nobody does so

) is to do a word by word manual comparison of the original document with the OCR'd text. This is extremely labour-intensive: I've had years of practice at it, and I reckon I can proof-read around about 15 pages an hour with a typical novel, so that would be about 33h work for a 500-page book.

09-29-2014, 01:43 AM	#1
adrenaline Enthusiast Posts: 43 Karma: 28554 Join Date: Mar 2013 Device: Kindle Keyboard, KPW2	Converting a scanned book from 1DollarScan to ePub Hello guys, This is a sample page from one of the books scanned using 1DollarScan (600 dpi): https://www.dropbox.com/s/j18r16ed7t...0Page.pdf?dl=0 I was thinking of trying Custom Book Scanning for the following reasons: 1. They offer ePub/MOBI for $10 more. 2. Their PDF scan is supposedly 1200 dpi. I saw posts of users here trying to convert their PDF to ePub by first converting it to HTML by Abbyy Fine Reader. Here's that page converted to HTML (Please refer to attachment). 1. Based on the results, I feel that an ePub would be terrible for my book. 2. Also, I read that scanning with higher DPI hurts OCR. Is that true? The main usage of these eBooks are just for text searching. I would have hard copies of the same books. Would really appreciate any comments on this.. So sorry for the long post! Attached Thumbnails Last edited by adrenaline; 09-29-2014 at 01:48 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Any ever use a book scanning service like 1dollarscan.com?	apastuszak	General Discussions	6	06-22-2014 10:38 AM
Converting large book from azw3 to epub failes	gameman	Conversion	5	12-15-2013 09:10 AM
truncation problem converting mobi book to epub	Joe9O	Conversion	3	02-08-2013 10:40 AM
Converting from a 1DollarScan pdf (saved as word doc)	BeccaPrice	Conversion	4	01-07-2013 08:14 AM
scanned book to epub	langmarp	General Discussions	3	06-28-2010 08:44 AM

09-29-2014, 01:56 AM	#2
JSWolf Resident Curmudgeon Posts: 84,054 Karma: 153695583 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	No matter how you go from PDF > ePub, you have to A/B compare the PDF to the ePub. You have A/B compare every character, every space, every punctuation mark, EVERYTHING in order to make 100% sure your ePub has no errors added by the conversion. I've seen too many PDF > ePub conversion where you know the source was PDF and the errors are due to the conversion.

09-29-2014, 02:04 AM	#3
adrenaline Enthusiast Posts: 43 Karma: 28554 Join Date: Mar 2013 Device: Kindle Keyboard, KPW2	Thanks a lot JSWolf. What do you think about the 1200 dpi scanning compared to 600? Thanks again.

09-29-2014, 03:05 AM	#5
HarryT eBook Enthusiast Posts: 85,560 Karma: 93980705 Join Date: Nov 2006 Location: UK Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6	Moved to the "Workshop" forum.

09-29-2014, 05:53 AM	#7
JoHunt I am what I am Posts: 6,625 Karma: 62235665 Join Date: Sep 2011 Device: iPad3, Voyage	Hi adrenaline I routinely buy out of print ebooks and send them for scanning. Based on my own experience: 1. No scanning service can convert a pdf into a decent epub/mobi. There are too many OCR errors in scanning to even consider this, so save the $10. 2. A pdf scan of 1200 dpi for an ebook is overkill and just produces a monstrously large file that will choke most programs. 3. I buy the books to read, so my workflow is to convert the pdf to html in Abbyy Finereader and then to convert the html to epub using Sigil. I have a pretty good idea of what to look for now, so the whole process is not that tedious and time consuming. My accuracy rate is about 95%, which is sufficient for me since I do the conversion for my own use only (I'd rather spend the time reading instead of comparing every single character). If you only need the ebooks to search text, would not a simple scan to pdf with OCR work? Why would you need to further convert them to epub/mobi?

09-29-2014, 06:29 AM	#9
JoHunt I am what I am Posts: 6,625 Karma: 62235665 Join Date: Sep 2011 Device: iPad3, Voyage	1 character in 20? I think not. Anyway 95% was just a guess because I rarely have more than 20 errors in an entire book I just wanted to stress that I'm happy with the results (and time saved) not comparing character for character.

09-29-2014, 07:05 AM	#11
mrmikel Color me gone Posts: 2,089 Karma: 1445295 Join Date: Apr 2008 Location: Central Oregon Coast Device: PRS-300	The main problem I see with super high dpi is that every smudge, dot, etc on the page becomes an character you have to get rid of. The only benefit at all might be for pictures if there are lots of them and they are very high quality. I don't think this is the common situation for out of print books, and if you are talking about eink, a colossal waste of time since the resolution is so low.

09-29-2014, 12:09 PM	#12
Ghitulescu Fanatic Posts: 563 Karma: 403106 Join Date: Aug 2014 Device: PRS-T1	I've seen a lot of scanned books in my life. Frankly, I would rather type them by hand than to correct their spelling mistakes and/or paginations. I believe a lot of the people that answered are English natives. Well, any OCR software can be trained to recognize 26 letters, but to non-ASCII users (like Bangla above) the errors a ten fold increased. For diacritics, it even be that scanning errors (like random black dots) may create a new character. A good example of my opinion can be found in archive.org. Compare the PDF (scanned but a text layer) and the EPUB files.

09-30-2014, 06:41 AM	#14
mrmikel Color me gone Posts: 2,089 Karma: 1445295 Join Date: Apr 2008 Location: Central Oregon Coast Device: PRS-300	I have seen it over and over again that a mistaken scan will produce perfectly plausible and grammatically correct, but wrong, output.