Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 09-29-2014, 01:43 AM   #1
adrenaline
Enthusiast
adrenaline solves Fermat’s last theorem while doing the crossword.adrenaline solves Fermat’s last theorem while doing the crossword.adrenaline solves Fermat’s last theorem while doing the crossword.adrenaline solves Fermat’s last theorem while doing the crossword.adrenaline solves Fermat’s last theorem while doing the crossword.adrenaline solves Fermat’s last theorem while doing the crossword.adrenaline solves Fermat’s last theorem while doing the crossword.adrenaline solves Fermat’s last theorem while doing the crossword.adrenaline solves Fermat’s last theorem while doing the crossword.adrenaline solves Fermat’s last theorem while doing the crossword.adrenaline solves Fermat’s last theorem while doing the crossword.
 
Posts: 43
Karma: 28554
Join Date: Mar 2013
Device: Kindle Keyboard, KPW2
Converting a scanned book from 1DollarScan to ePub

Hello guys,

This is a sample page from one of the books scanned using 1DollarScan (600 dpi):

https://www.dropbox.com/s/j18r16ed7t...0Page.pdf?dl=0

I was thinking of trying Custom Book Scanning for the following reasons:

1. They offer ePub/MOBI for $10 more.

2. Their PDF scan is supposedly 1200 dpi.

I saw posts of users here trying to convert their PDF to ePub by first converting it to HTML by Abbyy Fine Reader.

Here's that page converted to HTML (Please refer to attachment).

1. Based on the results, I feel that an ePub would be terrible for my book.

2. Also, I read that scanning with higher DPI hurts OCR. Is that true?

The main usage of these eBooks are just for text searching. I would have hard copies of the same books.

Would really appreciate any comments on this.. So sorry for the long post!
Attached Thumbnails
Click image for larger version

Name:	ScreenHunter_96 Sep. 28 23.40.png
Views:	396
Size:	70.9 KB
ID:	129052  

Last edited by adrenaline; 09-29-2014 at 01:48 AM.
adrenaline is offline   Reply With Quote
Old 09-29-2014, 01:56 AM   #2
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 73,984
Karma: 128903378
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
No matter how you go from PDF > ePub, you have to A/B compare the PDF to the ePub. You have A/B compare every character, every space, every punctuation mark, EVERYTHING in order to make 100% sure your ePub has no errors added by the conversion.

I've seen too many PDF > ePub conversion where you know the source was PDF and the errors are due to the conversion.
JSWolf is online now   Reply With Quote
Advert
Old 09-29-2014, 02:04 AM   #3
adrenaline
Enthusiast
adrenaline solves Fermat’s last theorem while doing the crossword.adrenaline solves Fermat’s last theorem while doing the crossword.adrenaline solves Fermat’s last theorem while doing the crossword.adrenaline solves Fermat’s last theorem while doing the crossword.adrenaline solves Fermat’s last theorem while doing the crossword.adrenaline solves Fermat’s last theorem while doing the crossword.adrenaline solves Fermat’s last theorem while doing the crossword.adrenaline solves Fermat’s last theorem while doing the crossword.adrenaline solves Fermat’s last theorem while doing the crossword.adrenaline solves Fermat’s last theorem while doing the crossword.adrenaline solves Fermat’s last theorem while doing the crossword.
 
Posts: 43
Karma: 28554
Join Date: Mar 2013
Device: Kindle Keyboard, KPW2
Thanks a lot JSWolf.

What do you think about the 1200 dpi scanning compared to 600?

Thanks again.
adrenaline is offline   Reply With Quote
Old 09-29-2014, 02:34 AM   #4
taustin
Wizard
taustin ought to be getting tired of karma fortunes by now.taustin ought to be getting tired of karma fortunes by now.taustin ought to be getting tired of karma fortunes by now.taustin ought to be getting tired of karma fortunes by now.taustin ought to be getting tired of karma fortunes by now.taustin ought to be getting tired of karma fortunes by now.taustin ought to be getting tired of karma fortunes by now.taustin ought to be getting tired of karma fortunes by now.taustin ought to be getting tired of karma fortunes by now.taustin ought to be getting tired of karma fortunes by now.taustin ought to be getting tired of karma fortunes by now.
 
Posts: 1,358
Karma: 5766642
Join Date: Aug 2010
Device: Nook
Quote:
Originally Posted by adrenaline View Post
Thanks a lot JSWolf.

What do you think about the 1200 dpi scanning compared to 600?

Thanks again.
Most OCR packages have a recommended range of resolution, including a maximum. Whether or not (or how much) it hurts the process is debatable, and likely depends on a lot of factors. But beyond about 300 dpi, it doesn't really help.

Don't most book scanning services include an OCR option? It's a heck of a lot easier to covert a Word file to epub than PDF.
taustin is offline   Reply With Quote
Old 09-29-2014, 03:05 AM   #5
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
Moved to the "Workshop" forum.
HarryT is offline   Reply With Quote
Advert
Old 09-29-2014, 04:32 AM   #6
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
Quote:
Originally Posted by taustin View Post
Whether or not (or how much) it hurts the process is debatable, and likely depends on a lot of factors. But beyond about 300 dpi, it doesn't really help.
Sorry, not completely true. For most books 300dpi would be sufficient, but it really depends on the source. I scan at 400dpi and get a lot less OCR errors, especially with older pockets and paperback. Everything over 600dpi would be overkill. Downside is the decrease in scanning speed. I find that 400dpi is a good tradeoff.
Toxaris is offline   Reply With Quote
Old 09-29-2014, 05:53 AM   #7
JoHunt
I am what I am
JoHunt ought to be getting tired of karma fortunes by now.JoHunt ought to be getting tired of karma fortunes by now.JoHunt ought to be getting tired of karma fortunes by now.JoHunt ought to be getting tired of karma fortunes by now.JoHunt ought to be getting tired of karma fortunes by now.JoHunt ought to be getting tired of karma fortunes by now.JoHunt ought to be getting tired of karma fortunes by now.JoHunt ought to be getting tired of karma fortunes by now.JoHunt ought to be getting tired of karma fortunes by now.JoHunt ought to be getting tired of karma fortunes by now.JoHunt ought to be getting tired of karma fortunes by now.
 
JoHunt's Avatar
 
Posts: 6,625
Karma: 62235665
Join Date: Sep 2011
Device: iPad3, Voyage
Hi adrenaline

I routinely buy out of print ebooks and send them for scanning. Based on my own experience:

1. No scanning service can convert a pdf into a decent epub/mobi. There are too many OCR errors in scanning to even consider this, so save the $10.

2. A pdf scan of 1200 dpi for an ebook is overkill and just produces a monstrously large file that will choke most programs.

3. I buy the books to read, so my workflow is to convert the pdf to html in Abbyy Finereader and then to convert the html to epub using Sigil. I have a pretty good idea of what to look for now, so the whole process is not that tedious and time consuming. My accuracy rate is about 95%, which is sufficient for me since I do the conversion for my own use only (I'd rather spend the time reading instead of comparing every single character).

If you only need the ebooks to search text, would not a simple scan to pdf with OCR work? Why would you need to further convert them to epub/mobi?
JoHunt is offline   Reply With Quote
Old 09-29-2014, 06:02 AM   #8
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
Quote:
Originally Posted by joehunt View Post
3. I buy the books to read, so my workflow is to convert the pdf to html in Abbyy Finereader and then to convert the html to epub using Sigil. I have a pretty good idea of what to look for now, so the whole process is not that tedious and time consuming. My accuracy rate is about 95%, which is sufficient for me since I do the conversion for my own use only (I'd rather spend the time reading instead of comparing every single character).
An OCR accuracy rate of 95% is an error in one character in 20, or about 1 in every 4 words, which is pretty appalling. Decent OCR should give you accuracy of about 99.9%, or 1 character in 1000, or about 1 in every 200 words (roughly 1 error per page).
HarryT is offline   Reply With Quote
Old 09-29-2014, 06:29 AM   #9
JoHunt
I am what I am
JoHunt ought to be getting tired of karma fortunes by now.JoHunt ought to be getting tired of karma fortunes by now.JoHunt ought to be getting tired of karma fortunes by now.JoHunt ought to be getting tired of karma fortunes by now.JoHunt ought to be getting tired of karma fortunes by now.JoHunt ought to be getting tired of karma fortunes by now.JoHunt ought to be getting tired of karma fortunes by now.JoHunt ought to be getting tired of karma fortunes by now.JoHunt ought to be getting tired of karma fortunes by now.JoHunt ought to be getting tired of karma fortunes by now.JoHunt ought to be getting tired of karma fortunes by now.
 
JoHunt's Avatar
 
Posts: 6,625
Karma: 62235665
Join Date: Sep 2011
Device: iPad3, Voyage
1 character in 20? I think not. Anyway 95% was just a guess because I rarely have more than 20 errors in an entire book I just wanted to stress that I'm happy with the results (and time saved) not comparing character for character.
JoHunt is offline   Reply With Quote
Old 09-29-2014, 06:31 AM   #10
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
Quote:
Originally Posted by joehunt View Post
1 character in 20? I think not. Anyway 95% was just a guess because I rarely have more than 20 errors in an entire book
Then you have an accuracy rate of well over 99.9%, which is what one would expect from a decent OCR program like Abbyy Finereader. 95% accuracy for OCR would be truly abysmal .
HarryT is offline   Reply With Quote
Old 09-29-2014, 07:05 AM   #11
mrmikel
Color me gone
mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.
 
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
The main problem I see with super high dpi is that every smudge, dot, etc on the page becomes an character you have to get rid of. The only benefit at all might be for pictures if there are lots of them and they are very high quality. I don't think this is the common situation for out of print books, and if you are talking about eink, a colossal waste of time since the resolution is so low.
mrmikel is offline   Reply With Quote
Old 09-29-2014, 12:09 PM   #12
Ghitulescu
Fanatic
Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.Ghitulescu ought to be getting tired of karma fortunes by now.
 
Posts: 563
Karma: 403106
Join Date: Aug 2014
Device: PRS-T1
I've seen a lot of scanned books in my life.
Frankly, I would rather type them by hand than to correct their spelling mistakes and/or paginations.

I believe a lot of the people that answered are English natives. Well, any OCR software can be trained to recognize 26 letters, but to non-ASCII users (like Bangla above) the errors a ten fold increased. For diacritics, it even be that scanning errors (like random black dots) may create a new character.

A good example of my opinion can be found in archive.org. Compare the PDF (scanned but a text layer) and the EPUB files.
Ghitulescu is offline   Reply With Quote
Old 09-29-2014, 09:04 PM   #13
Hitch
Bookmaker & Cat Slave
Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.Hitch ought to be getting tired of karma fortunes by now.
 
Hitch's Avatar
 
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
Quote:
Originally Posted by Ghitulescu View Post
I've seen a lot of scanned books in my life.
Frankly, I would rather type them by hand than to correct their spelling mistakes and/or paginations.

I believe a lot of the people that answered are English natives. Well, any OCR software can be trained to recognize 26 letters, but to non-ASCII users (like Bangla above) the errors a ten fold increased. For diacritics, it even be that scanning errors (like random black dots) may create a new character.

A good example of my opinion can be found in archive.org. Compare the PDF (scanned but a text layer) and the EPUB files.
Incompetent scanning and OCR will always result in poor-quality output. A good scanner, with competent OCR, can achieve a 99.995% rate. That's imperfect, but not bad. Of course, $1/scan and that ilk aren't going to give you a 99.995, because they're not running human A/B compares, which is, realistically, the only way to get to that level of quality. {shrug}.

I certainly would not consider typing a book instead of scanning it. No offense, but I find the idea crazy. Take a high-quality scan, a good A/B, run it through Toxaris' program, and you have a very, very high quality starting place.

The problem we see on these forums--all the time--is that nobody ever wants to do the "grunty" work of correcting the scanned material. Everybody wants a magic bullet. It doesn't exist.

Hitch
Hitch is offline   Reply With Quote
Old 09-30-2014, 06:41 AM   #14
mrmikel
Color me gone
mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.
 
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
I have seen it over and over again that a mistaken scan will produce perfectly plausible and grammatically correct, but wrong, output.
mrmikel is offline   Reply With Quote
Old 09-30-2014, 06:51 AM   #15
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
Quote:
Originally Posted by mrmikel View Post
I have seen it over and over again that a mistaken scan will produce perfectly plausible and grammatically correct, but wrong, output.
Precisely. Errors such as "dock" instead of "clock", "comer" instead of "corner", etc, are commonplace, and spell-checkers won't find them. The only way to find such errors (and I must politely disagree with Hitch's assertion that nobody does so ) is to do a word by word manual comparison of the original document with the OCR'd text. This is extremely labour-intensive: I've had years of practice at it, and I reckon I can proof-read around about 15 pages an hour with a typical novel, so that would be about 33h work for a 500-page book.
HarryT is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Any ever use a book scanning service like 1dollarscan.com? apastuszak General Discussions 6 06-22-2014 10:38 AM
Converting large book from azw3 to epub failes gameman Conversion 5 12-15-2013 09:10 AM
truncation problem converting mobi book to epub Joe9O Conversion 3 02-08-2013 10:40 AM
Converting from a 1DollarScan pdf (saved as word doc) BeccaPrice Conversion 4 01-07-2013 08:14 AM
scanned book to epub langmarp General Discussions 3 06-28-2010 08:44 AM


All times are GMT -4. The time now is 11:42 AM.


MobileRead.com is a privately owned, operated and funded community.