View Full Version : Illegible EPUB Text


tebo
08-05-2013, 12:19 AM
I have a problem where the text of some EPUBs are rendering incorrectly in Nook PC (as well as calibre E-book reader). I have experienced the same issue with books downloaded from B&N as well as Gutenberg.

As an example from the book "A Greek-English lexicon of the New Testament: being Grimm's Wilke's Clavis ..."

From Content.opf:
Book digitized by Google and uploaded to the Internet Archive by user tpb.

From the HTML page metadata:
<meta content="abbyy to epub tool, v0.2" name="generator"/>
<meta content="application/xhtml+xml; charset=utf-8" http-equiv="Content-Type"/>

Example rendering:
Ιι 1ΐ35 κιίΓνϊνβϋ 1οη βηοιι^ ίοΓ Λε ςορ>τϊ1ιΙ Ιο οχρϊτο 3ΐΐ(3 ΐΗο Ιχ)ο1; Ιο οηΙΟΓ ΐΗο ριιΒΠς 1οπΐ3Ϊη. Α

B&N tells me that the file is corrupt, but I've seen this in many different EPUBs. It seems to be an issue with rendering Unicode characters (there is a mixture of Greek and English in the above example).

Any ideas?

Doitsu
08-05-2013, 02:05 AM
The garbage text is most likely caused by the automated OCR of non-Latin text. I'd recommend downloading this very similar PG Greek English NT lexicon (http://www.gutenberg.org/ebooks/40935) instead.
(In order to read this book on your Nook, you'll most likely have to embed a Greek font, e.g. Galatia SIL (http://scripts.sil.org/cms/scripts/page.php?item_id=GalatiaSIL), which you can embed automatically with Calibre or manually with Sigil (http://web.sigil.googlecode.com/git/files/OEBPS/Text/tutorial_embed_fonts.html).)

tebo
08-06-2013, 01:01 AM
Thank you Doitsu. Your answer is the first meaningful answer I have been given. It is clear that you took great care to understand my problem.

I believe you are correct about the cause of the garbage text in the Ebooks, since Google has removed them from the Google Books site.

I previously had downloaded your suggestion from PG and the installation of Galatia SIL fixed the fonts issue I was having. However, there are still problems with line feeds, so I may just convert the UTF-8 version.

Thanks again.

Tex2002ans
08-06-2013, 06:27 AM
As Doitsu mentioned, if the text is outside of the Latin character set, it is most likely to be a much lower quality OCR.

[...] Book digitized by Google and uploaded to the Internet Archive by user tpb.[...]

The text versions generated by Archive.org (and Google.com) are usually quite poor. All that happens on their end is that the scans of the book are automatically fed through OCR, and the text output is run through some templates to plop it into different format (EPUB, Kindle, plain TXT, ...).

Then you take into account markings/scanning artifacts/water damage/aging of the book, and the automatic OCR becomes even worse.

Images -> Text is an incredibly hard area to get algorithms to do correctly without lots of human assistance.

Project Gutenberg books are fed through multiple rounds of human assisted checking/editing, to try to get as accurate a conversion as possible. So if possible, try to look to Project Gutenberg first.

A lot more information on Project Gutenberg's process can be found here:

http://www.pgdp.net/c/faq/ProoferFAQ.php