View Single Post
Old 10-25-2011, 10:11 AM   #8
MrTeatime
Connoisseur
MrTeatime ought to be getting tired of karma fortunes by now.MrTeatime ought to be getting tired of karma fortunes by now.MrTeatime ought to be getting tired of karma fortunes by now.MrTeatime ought to be getting tired of karma fortunes by now.MrTeatime ought to be getting tired of karma fortunes by now.MrTeatime ought to be getting tired of karma fortunes by now.MrTeatime ought to be getting tired of karma fortunes by now.MrTeatime ought to be getting tired of karma fortunes by now.MrTeatime ought to be getting tired of karma fortunes by now.MrTeatime ought to be getting tired of karma fortunes by now.MrTeatime ought to be getting tired of karma fortunes by now.
 
Posts: 61
Karma: 513276
Join Date: Mar 2011
Device: Sony PRS-350
I love mine, and have scanned several thousand(!) pages with it. I knocked up a quick sample page here (with OCR output from Google's open-source "Tesseract" OCR library), with some hints on how to get good, clean scans, including pages where the text is quite near the outer margins (skip down to the "Update!" section):

http://etotheipiplusone.com/vupoint-...r/samples.html

I can generally get around 2-300 pages per hour scanned, and have never resorted to cutting up or otherwise damaging a book (although one book, with text very close to the inner margins, did need to be "stretched" open a bit, which creased the spine).

Incidentally, what are people using for proofreading of scans? I'm currently doing a practice programming project (codenamed "OcrFixer", but will ultimately be called "Qutenberg") aimed at helping with my specific workflow. It will be open-source, cross-platform, use Tesseract OCR (though I may make the OCR engine pluggable), scriptable, have auto-paragraphisation based on indentation [which, in my proof-of-concept tests, works pleasingly well!], have an optional "server" mode for real-time collaborative editing, smart-quoting, html/ ePub export, OCR'd PDF import, listing & highlighting of spelling errors/ other warnings [mis-split words at the end of lines; common OCR errors like 'retumed' instead of 'returned', etc] , "blessing" regions that you are certain are correct so they don't show up in the warning list/ you can quickly see you don't need to read them again, infinite undo, semantic markup (this is a chapter heading; this is a subchapter heading; this is an ordinary paragraph; this is a paragraph that should be styled as a letter/ telegram/ whatever; etc), spinning off e.g. footnotes into a separate "sub-document" and adding back and forth hyperlinks, etc.

If anyone would find such a beast handy, then it might light a fire under my arse and stop me procrastinating If there's already something out there that does all this and more, though, no problem: I'll do it anyway as, as I said, it's a good practice project

A really, really, really early screenshot is here to show what the basic layout will look like; the "Problems/ Warnings" etc window will probably be added as a dock:

http://etotheipiplusone.com/vupoint-...ally_early.png

The paragraphing was auto-generated based on indentation in the source image. Headers, font stylings (bold, italics, etc) will ultimately be displayed in the text area in the bottom half, and styled paragraphs (e.g letters) will be shown with the approximate style (margins, text alignment, etc). "Blessed" text will probably have a grey background instead of white. The yellow wavy lines are one kind of Problem/ Warning: that where Tesseract's uncertainty about the word exceeds a configurable limit.

This is all, of course, several months away
MrTeatime is offline   Reply With Quote