View Single Post
Old 08-06-2013, 05:27 AM   #4
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
As Doitsu mentioned, if the text is outside of the Latin character set, it is most likely to be a much lower quality OCR.

Quote:
Originally Posted by tebo View Post
[...] Book digitized by Google and uploaded to the Internet Archive by user tpb.[...]
The text versions generated by Archive.org (and Google.com) are usually quite poor. All that happens on their end is that the scans of the book are automatically fed through OCR, and the text output is run through some templates to plop it into different format (EPUB, Kindle, plain TXT, ...).

Then you take into account markings/scanning artifacts/water damage/aging of the book, and the automatic OCR becomes even worse.

Images -> Text is an incredibly hard area to get algorithms to do correctly without lots of human assistance.

Project Gutenberg books are fed through multiple rounds of human assisted checking/editing, to try to get as accurate a conversion as possible. So if possible, try to look to Project Gutenberg first.

A lot more information on Project Gutenberg's process can be found here:

http://www.pgdp.net/c/faq/ProoferFAQ.php

Last edited by Tex2002ans; 08-06-2013 at 05:33 AM.
Tex2002ans is offline   Reply With Quote