Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book General > General Discussions

Notices

Reply
 
Thread Tools Search this Thread
Old 03-22-2014, 09:37 AM   #16
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by HarryT View Post
I find it difficult to believe that you can "completely proof" a book in 8-15h. "Completely proof" to me means comparing the ebook to the original source material comma by comma, letter by letter, word by word, line by line. I've been proof-reading books for many years and can't manage more than 10 pages an hour, or some 40h for a 400 page book.
Ok, perhaps if we are using your definition of "complete proof". (character by character A/B compare, ... there is just no economically feasible way to do this).

I mean going through multiple thorough rounds of successive Formatting/Quality Checking... Applying/searching different fixes each round (Spellcheck, ligatures/accented characters, consistent hyphenation, punctuation errors, inconsistent spelling, etc. etc.)

Feel free to look at any of my EPUBs and let me know of errors. While probably not "100%" error free, the amount of errors can probably be counted on one hand.

You soon reach a period of diminishing returns. You can spend ~8-15 hours (average for my genre of book) to wittle it down to a handful of errors... and then you can spend about 40 more hours doing a character-by-character check to catch those final handful of errors... or I could have spent that time converting about 2-5 more books QUITE accurately.

Side Note: Should we count actual typos/errors fixed from the original book as negative?

Last edited by Tex2002ans; 03-22-2014 at 09:46 AM.
Tex2002ans is offline   Reply With Quote
Old 03-22-2014, 09:45 AM   #17
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383099
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
Quote:
Originally Posted by Tex2002ans View Post
Ok, perhaps if we are using your definition of "complete proof". (character by character A/B compare, ... there is just no economically feasible way to do this).

I mean going through multiple thorough rounds of successive Formatting/Quality Checking... Applying/searching different fixes each round (Spellcheck, ligatures/accented characters, consistent hyphenation, punctuation errors, inconsistent spelling, etc. etc.)

Feel free to look at any of my EPUBs and let me know of errors. While probably not "100%" error free, the amount of errors can probably be counted on one hand.
Your method will certainly produce a good reading copy. What it won't do - and this is very important - is find missing text. You'd probably be surprised how many books I've proof-read where the scanner has missed text at the top or the bottom of the page, or even completely missed out a double page of text, and it's not always at all obvious from simply reading the text that this has taken place. That's why it's so important to compare to the original, and not simply take the OCR'd text in isolation.

Last edited by HarryT; 03-22-2014 at 09:50 AM.
HarryT is offline   Reply With Quote
Advert
Old 03-22-2014, 10:02 AM   #18
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by AJ Starr View Post
I have been looking at a portable double sided ASF scanner, a Brother 720D, lately, but haven't purchased it yet. It would necessitate taking my pbs apart and scanning page by page. Does anyone have one?
If you don't mind destroying your books... I would go the feed scanner route over a typical scanner.

It is MUCH faster (and the important thing is you can go do other things while it scans). They do get jammed up every once in a while (especially if you cut the binding, there might be glue, or "flakes" of paper sticking together), so I would still be in the vicinity just in case something goes awry.

And also you should be sure to double-check that all of the page numbers are in the correct order, and that no pages got stuck to eachother while going through.

We typically fed them in piles of 15-20 or so pages (if I recall correctly, it has been a few years since I scanned a book), and then we would double-check the output was correct before feeding the next pile of pages. Doing it in batches like that will save you headaches in the long-run.

Quote:
Originally Posted by HarryT View Post
Your method will certainly produce a good reading copy. What it won't do - and this is very important - is find missing text. You'd probably be surprised how many books I've proof-read where the scanner has missed text at the top or the bottom of the page, or even completely missed out a double page of text, and it's not always at all obvious from simply reading the text that this has taken place.
Yes yes, that is another thing to keep in mind when OCRing. The OCR programs typically compute a "box" around the text, and sometimes they mess up badly for who knows why.

Usually this is noticed in one of the "multiple passes" stage. Chances are VERY HIGH that you catch a missing chunk while double-checking/looking for all those other common errors. I am typically searching/flip-flopping back/forth between EPUB/Finereader, just double-checking spelling, hyphenation, missing punctuation, "is this an actual error/typo", and things like that.

Of course, there can always be that freak perfect storm!

I should actually jot that one down in my notes though, thanks for bringing it up HarryT.

Last edited by Tex2002ans; 03-22-2014 at 10:04 AM.
Tex2002ans is offline   Reply With Quote
Old 03-22-2014, 10:20 AM   #19
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383099
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
Quote:
Originally Posted by Tex2002ans View Post
It is MUCH faster (and the important thing is you can go do other things while it scans). They do get jammed up every once in a while (especially if you cut the binding, there might be glue, or "flakes" of paper sticking together), so I would still be in the vicinity just in case something goes awry.
To avoid that happening, best thing to do is to find a local printer with a powered guillotine and get them to cut the spine of the book off, rather than just tearing the pages out. My local printer is happy to do this for free (although I do buy stationery supplies from him, which probably helps!).
HarryT is offline   Reply With Quote
Old 03-22-2014, 01:12 PM   #20
Hamlet53
Nameless Being
 
Over the past few months I have been digitizing many of my old books. I use a setup similar to what the fellow in the attached video uses to remove the binding and yield uniform size and smoothly cut pages. For those who do not have a professional service to do it for them and who do not want to try and cut a few pages at a time this works well if you have the equipment. I also purchased an auto-feed scanner that came with very good OCR software; that only set me back about $250. I can scan and OCR about 10 pages a minute. Proofing is definitely required to catch missed text, errors in character reads (interspersed italics are a particular problem), and get paragraph breaks all correct. I can proof about 20-40 pages in an hour, depending on how much text is on a page. I find that the larger the font the more accurate the OCR process is.

  Reply With Quote
Advert
Old 03-23-2014, 12:59 AM   #21
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Hamlet53 View Post
Over the past few months I have been digitizing many of my old books. I use a setup similar to what the fellow in the attached video uses to remove the binding and yield uniform size and smoothly cut pages.
Wow, that double-sided feed reader seems NICE. The one that I used was a single-sided, so we had to run the pages through the other direction as well, taking double the time.

If I was doing book scanning seriously, and on more of a mass scale, I would definitely invest more money initially for the double-sided scanners.

Quote:
Originally Posted by Hamlet53 View Post
I find that the larger the font the more accurate the OCR process is.
It doesn't really matter upon font size, more upon how "crisp" the image is (the DPI, how good the lighting was, how good the hardware is that is doing the scanning, how good the source material is, ...). A whole bunch of different variables at play.... and as I mentioned in one of the other posts, it can "look fine" according to the human eye, but go horribly wrong when OCRed.

Also keep in mind writing, highlighting, markings, will severely lower the speed/accuracy of the OCR (people who write in books MUST BE DESTROYED).

We also had a lot of discussion in this topic (about digitizing/OCRing math books): https://www.mobileread.com/forums/sho...d.php?t=228413

See my Post #16 showing off a few real-life examples of some of the worst markings I have run across: https://www.mobileread.com/forums/sho...2&postcount=16

Also, back to the different OCR programs... There is also a free OCR engine by Google called Tesseract: https://en.wikipedia.org/wiki/Tesseract_%28software%29
Tex2002ans is offline   Reply With Quote
Old 03-26-2014, 12:59 AM   #22
qsipl
Enthusiast
qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.qsipl ought to be getting tired of karma fortunes by now.
 
Posts: 25
Karma: 412584
Join Date: Feb 2014
Device: IPAD, KF8 & Tablet
Compare extracted text

Thank U.

Really help us these discussion about the OCR engine.

Can any one suggest any software for Text compare of extracted text (from Abby find reader) with original PDF except PDF compare.

Regards,
Qsipl
qsipl is offline   Reply With Quote
Old 03-26-2014, 02:07 AM   #23
chainring
Addict
chainring ought to be getting tired of karma fortunes by now.chainring ought to be getting tired of karma fortunes by now.chainring ought to be getting tired of karma fortunes by now.chainring ought to be getting tired of karma fortunes by now.chainring ought to be getting tired of karma fortunes by now.chainring ought to be getting tired of karma fortunes by now.chainring ought to be getting tired of karma fortunes by now.chainring ought to be getting tired of karma fortunes by now.chainring ought to be getting tired of karma fortunes by now.chainring ought to be getting tired of karma fortunes by now.chainring ought to be getting tired of karma fortunes by now.
 
chainring's Avatar
 
Posts: 210
Karma: 1000659
Join Date: Jan 2009
Location: Sunnyvale, CA
Device: Kindle Voyage, Kobo Aura H2O, PRS-650 (black), Kindle 3G
Abbyy FineReader is probably the best for accuracy. I use Acrobat XI Pro and scan with a Canon DR-2050C.

http://www.amazon.com/Canon-DR-2050C...words=dr-2050c

I recomended a Canon DR-C125 to a friend, set it up for her, and she LOVES it. Speedy, duplex scans at around 20 pages per minute, and has a cool u-turn path to minimize the desk space taken. I also really like that it uses TWAIN drivers, so any app with a scanning/capture interface can hook into the scanner.

http://www.amazon.com/Canon-imageFOR...QHK536Y1SJV72W
chainring is offline   Reply With Quote
Old 03-30-2014, 10:42 PM   #24
tempura
Connoisseur
tempura knows more than wikipediatempura knows more than wikipediatempura knows more than wikipediatempura knows more than wikipediatempura knows more than wikipediatempura knows more than wikipediatempura knows more than wikipediatempura knows more than wikipediatempura knows more than wikipediatempura knows more than wikipediatempura knows more than wikipedia
 
Posts: 71
Karma: 47102
Join Date: Dec 2013
Location: Outside the Universe
Device: Kindle PW3
Well if you are looking for an open source alternative then you could try "Tesseract".

Also just a suggestion that after scanning you could run the scans through "Scan Tailor" which is really good for producing clean outputs from a scan and then you should run the OCR.

Last edited by tempura; 03-31-2014 at 12:36 AM.
tempura is offline   Reply With Quote
Old 03-30-2014, 11:10 PM   #25
Marcy
Guru
Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.
 
Marcy's Avatar
 
Posts: 897
Karma: 950683
Join Date: Oct 2009
Device: Kobo Libra2
I don't have the patience of you that flatbed scan page by page. We don't have television, so I can't watch while scanning.

I'm thinking of trying to just razor apart my paperbacks and put them through my ADF on my work machine.

I have an old favorite that I'd desperately like as an ebook and a new book I bought a few weeks ago that is printed in such an absurdly small font that it is unreadable for me. I'll have to pick up a few cheapie used paperbacks and test out how well this works, especially as the new one was $20 and I don't want to destroy it for nothing. I don't mind the tedium of proof-reading even a not-so-great OCR copy, but couldn't bear scanning manually having to turn the page and press down the book.
Marcy is offline   Reply With Quote
Old 03-30-2014, 11:21 PM   #26
rkomar
Wizard
rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.rkomar ought to be getting tired of karma fortunes by now.
 
Posts: 3,054
Karma: 18821071
Join Date: Oct 2010
Location: Sudbury, ON, Canada
Device: PRS-505, PB 902, PRS-T1, PB 623, PB 840, PB 633
Quote:
Originally Posted by Marcy View Post
... I don't mind the tedium of proof-reading even a not-so-great OCR copy, but couldn't bear scanning manually having to turn the page and press down the book.
I'm of the opposite opinion. Scanning is tedious but easy; proof reading is tedious and hard work, and takes a lot longer. Too bad we couldn't split the work!
rkomar is offline   Reply With Quote
Old 04-02-2014, 11:26 PM   #27
Kumabjorn
Basculocolpic
Kumabjorn ought to be getting tired of karma fortunes by now.Kumabjorn ought to be getting tired of karma fortunes by now.Kumabjorn ought to be getting tired of karma fortunes by now.Kumabjorn ought to be getting tired of karma fortunes by now.Kumabjorn ought to be getting tired of karma fortunes by now.Kumabjorn ought to be getting tired of karma fortunes by now.Kumabjorn ought to be getting tired of karma fortunes by now.Kumabjorn ought to be getting tired of karma fortunes by now.Kumabjorn ought to be getting tired of karma fortunes by now.Kumabjorn ought to be getting tired of karma fortunes by now.Kumabjorn ought to be getting tired of karma fortunes by now.
 
Kumabjorn's Avatar
 
Posts: 4,356
Karma: 20181319
Join Date: Jul 2010
Location: Sweden
Device: Kindle 3 WiFi, Kindle 4SO, Kindle for Android, Sony PRS-350 and PRS-T1
Quote:
Originally Posted by Marcy View Post
I don't have the patience of you that flatbed scan page by page. We don't have television, so I can't watch while scanning.

I'm thinking of trying to just razor apart my paperbacks and put them through my ADF on my work machine.

I have an old favorite that I'd desperately like as an ebook and a new book I bought a few weeks ago that is printed in such an absurdly small font that it is unreadable for me. I'll have to pick up a few cheapie used paperbacks and test out how well this works, especially as the new one was $20 and I don't want to destroy it for nothing. I don't mind the tedium of proof-reading even a not-so-great OCR copy, but couldn't bear scanning manually having to turn the page and press down the book.
That is what we have boyfriends for.
Kumabjorn is offline   Reply With Quote
Old 04-03-2014, 01:49 PM   #28
alanHd
Addict
alanHd ought to be getting tired of karma fortunes by now.alanHd ought to be getting tired of karma fortunes by now.alanHd ought to be getting tired of karma fortunes by now.alanHd ought to be getting tired of karma fortunes by now.alanHd ought to be getting tired of karma fortunes by now.alanHd ought to be getting tired of karma fortunes by now.alanHd ought to be getting tired of karma fortunes by now.alanHd ought to be getting tired of karma fortunes by now.alanHd ought to be getting tired of karma fortunes by now.alanHd ought to be getting tired of karma fortunes by now.alanHd ought to be getting tired of karma fortunes by now.
 
alanHd's Avatar
 
Posts: 374
Karma: 1408579
Join Date: Jul 2012
Location: UK
Device: Kindle Touch, Ipod Touch, Ipad Air
I have just finished my first ever scan of a book on my flat bed scanner, I set the timer between scans so I didn't have to keep pressing the button, but by the end of the book I was losing the will to live.
alanHd is offline   Reply With Quote
Old 04-03-2014, 06:05 PM   #29
Hamlet53
Nameless Being
 
Quote:
Originally Posted by alanHd View Post
I have just finished my first ever scan of a book on my flat bed scanner, I set the timer between scans so I didn't have to keep pressing the button, but by the end of the book I was losing the will to live.
I made my first attempt at digitizing a book using a flatbed scanner. I really can't imagine anyone doing it that way. Especially if one attempts to do it while leaving the book intact. Trying to get a scan that will yield even reasonably accurate result from the OCR process while having to press down on the book during the scan and at the same time being sure that it is correctly aligned, how anyone could manage this at any reasonable production rate is beyond me.
  Reply With Quote
Old 04-03-2014, 06:49 PM   #30
markom
Banned
markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.
 
Posts: 488
Karma: 1080260
Join Date: Sep 2012
Device: sony prs t1 kindle dx ipad
Quote:
Originally Posted by Hamlet53 View Post
I made my first attempt at digitizing a book using a flatbed scanner. I really can't imagine anyone doing it that way. Especially if one attempts to do it while leaving the book intact. Trying to get a scan that will yield even reasonably accurate result from the OCR process while having to press down on the book during the scan and at the same time being sure that it is correctly aligned, how anyone could manage this at any reasonable production rate is beyond me.
I scan at 3-4 passes/minute at 300 dpi grayscale or color(6-8 pages/minute if book fits on the glass double sided ) on Canon 9000 F without any problems, listening to the music even watching films if book is of A5 or smaller format.

I usually scan one or two books per month and was able to scan at that rate (6-8 pages/min) after just a couple of books scanned.

For correct alignment I completely rely on scanner's raised edges pushing(sliding) the book automatically as far as it goes already knowing where approx. the center of the book (spine) should meet the raised edge, because I put some adhesive tape there to mark the place.

I never press down spine too hard, never would lower the scanner's lid (it's always up), usually scanning in the dark room (light coming from computer screen), always manually clicking the mouse to scan a current page (with mouse pointer centered on the scan button), lifting the book and flipping for a next page the moment ccd mechanism starts coming back after finished scanning, so that by the time returning ccd mechanism stops my book is usually already fixed for scanning.

After every 20-30 pages I would automatically use some soft cloth (usually some T-shirt at hand ) to quickly clean the glass from possible hairs, dust particles etc.

I don't care much about OCR precision though, because I always use pdf with OCR layer in the background (exact image in Abbyy or clearscan in Acrobat).


There are also affordable contactless scanners(document cameras), for those who would like to save their books from cutting for automatic document feeders.

https://www.mobileread.com/forums/sho...hlight=scanner

Last edited by markom; 04-03-2014 at 09:04 PM.
markom is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Regex engine huebi Sigil 1 02-23-2012 02:53 AM
How to convert an OCR file to a Non-OCR one res9282 PDF 1 08-05-2011 05:58 AM
Search Engine alroy Calibre 1 11-06-2010 01:39 AM
Regex engine? troymc Sigil 10 07-09-2010 04:52 PM


All times are GMT -4. The time now is 10:18 AM.


MobileRead.com is a privately owned, operated and funded community.