Illegible EPUB Text

tebo · 08-04-2013, 11:19 PM

I have a problem where the text of some EPUBs are rendering incorrectly in Nook PC (as well as calibre E-book reader). I have experienced the same issue with books downloaded from B&N as well as Gutenberg.

As an example from the book "A Greek-English lexicon of the New Testament: being Grimm's Wilke's Clavis ..."

From Content.opf:
Book digitized by Google and uploaded to the Internet Archive by user tpb.

From the HTML page metadata:
<meta content="abbyy to epub tool, v0.2" name="generator"/>
<meta content="application/xhtml+xml; charset=utf-8" http-equiv="Content-Type"/>

Example rendering:
Ιι 1ΐ35 κιίΓνϊνβϋ 1οη§ βηοιι^ ίοΓ Λε ςορ>τϊ§1ιΙ Ιο οχρϊτο 3ΐΐ(3 ΐΗο Ιχ)ο1; Ιο οηΙΟΓ ΐΗο ριιΒΠς »1οπΐ3Ϊη. Α

B&N tells me that the file is corrupt, but I've seen this in many different EPUBs. It seems to be an issue with rendering Unicode characters (there is a mixture of Greek and English in the above example).

Any ideas?

Doitsu · 08-05-2013, 01:05 AM

The garbage text is most likely caused by the automated OCR of non-Latin text. I'd recommend downloading this very similar PG Greek English NT lexicon instead.
(In order to read this book on your Nook, you'll most likely have to embed a Greek font, e.g. Galatia SIL, which you can embed automatically with Calibre or manually with Sigil.)

tebo · 08-06-2013, 12:01 AM

Thank you Doitsu. Your answer is the first meaningful answer I have been given. It is clear that you took great care to understand my problem.

I believe you are correct about the cause of the garbage text in the Ebooks, since Google has removed them from the Google Books site.

I previously had downloaded your suggestion from PG and the installation of Galatia SIL fixed the fonts issue I was having. However, there are still problems with line feeds, so I may just convert the UTF-8 version.

Thanks again.

Tex2002ans · 08-06-2013, 05:27 AM

As Doitsu mentioned, if the text is outside of the Latin character set, it is most likely to be a much lower quality OCR.

Quote:

Originally Posted by tebo

[...] Book digitized by Google and uploaded to the Internet Archive by user tpb.[...]

The text versions generated by Archive.org (and Google.com) are usually quite poor. All that happens on their end is that the scans of the book are automatically fed through OCR, and the text output is run through some templates to plop it into different format (EPUB, Kindle, plain TXT, ...).

Then you take into account markings/scanning artifacts/water damage/aging of the book, and the automatic OCR becomes even worse.

Images -> Text is an incredibly hard area to get algorithms to do correctly without lots of human assistance.

Project Gutenberg books are fed through multiple rounds of human assisted checking/editing, to try to get as accurate a conversion as possible. So if possible, try to look to Project Gutenberg first.

A lot more information on Project Gutenberg's process can be found here:

http://www.pgdp.net/c/faq/ProoferFAQ.php

08-04-2013, 11:19 PM	#1
tebo Junior Member Posts: 2 Karma: 10 Join Date: Aug 2013 Device: Nook	Illegible EPUB Text I have a problem where the text of some EPUBs are rendering incorrectly in Nook PC (as well as calibre E-book reader). I have experienced the same issue with books downloaded from B&N as well as Gutenberg. As an example from the book "A Greek-English lexicon of the New Testament: being Grimm's Wilke's Clavis ..." From Content.opf: Book digitized by Google and uploaded to the Internet Archive by user tpb. From the HTML page metadata: <meta content="abbyy to epub tool, v0.2" name="generator"/> <meta content="application/xhtml+xml; charset=utf-8" http-equiv="Content-Type"/> Example rendering: Ιι 1ΐ35 κιίΓνϊνβϋ 1οη§ βηοιι^ ίοΓ Λε ςορ>τϊ§1ιΙ Ιο οχρϊτο 3ΐΐ(3 ΐΗο Ιχ)ο1; Ιο οηΙΟΓ ΐΗο ριιΒΠς »1οπΐ3Ϊη. Α B&N tells me that the file is corrupt, but I've seen this in many different EPUBs. It seems to be an issue with rendering Unicode characters (there is a mixture of Greek and English in the above example). Any ideas?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Italics in Epub text	Jamestoo	Sigil	7	11-09-2011 03:16 AM
Text --> ePub	rpmazur	Conversion	6	10-19-2011 07:23 AM
Center align text in epub	virtual_ink	ePub	23	08-31-2011 06:27 AM
EPUB Overlapping Text - Please Help	coaver	Calibre	16	07-27-2010 12:40 AM
Justified text in ePub?	kiwik	ePub	5	03-07-2009 02:35 PM

08-05-2013, 01:05 AM	#2
Doitsu Grand Sorcerer Posts: 5,761 Karma: 24088559 Join Date: Dec 2010 Device: Kindle PW2	The garbage text is most likely caused by the automated OCR of non-Latin text. I'd recommend downloading this very similar PG Greek English NT lexicon instead. (In order to read this book on your Nook, you'll most likely have to embed a Greek font, e.g. Galatia SIL, which you can embed automatically with Calibre or manually with Sigil.)

08-06-2013, 12:01 AM	#3
tebo Junior Member Posts: 2 Karma: 10 Join Date: Aug 2013 Device: Nook	Thank you Doitsu. Your answer is the first meaningful answer I have been given. It is clear that you took great care to understand my problem. I believe you are correct about the cause of the garbage text in the Ebooks, since Google has removed them from the Google Books site. I previously had downloaded your suggestion from PG and the installation of Galatia SIL fixed the fonts issue I was having. However, there are still problems with line feeds, so I may just convert the UTF-8 version. Thanks again.

Advert