08-04-2013, 11:19 PM | #1 |
Junior Member
Posts: 2
Karma: 10
Join Date: Aug 2013
Device: Nook
|
Illegible EPUB Text
I have a problem where the text of some EPUBs are rendering incorrectly in Nook PC (as well as calibre E-book reader). I have experienced the same issue with books downloaded from B&N as well as Gutenberg.
As an example from the book "A Greek-English lexicon of the New Testament: being Grimm's Wilke's Clavis ..." From Content.opf: Book digitized by Google and uploaded to the Internet Archive by user tpb. From the HTML page metadata: <meta content="abbyy to epub tool, v0.2" name="generator"/> <meta content="application/xhtml+xml; charset=utf-8" http-equiv="Content-Type"/> Example rendering: Ιι 1ΐ35 κιίΓνϊνβϋ 1οη§ βηοιι^ ίοΓ Λε ςορ>τϊ§1ιΙ Ιο οχρϊτο 3ΐΐ(3 ΐΗο Ιχ)ο1; Ιο οηΙΟΓ ΐΗο ριιΒΠς »1οπΐ3Ϊη. Α B&N tells me that the file is corrupt, but I've seen this in many different EPUBs. It seems to be an issue with rendering Unicode characters (there is a mixture of Greek and English in the above example). Any ideas? |
08-05-2013, 01:05 AM | #2 |
Grand Sorcerer
Posts: 5,584
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
|
The garbage text is most likely caused by the automated OCR of non-Latin text. I'd recommend downloading this very similar PG Greek English NT lexicon instead.
(In order to read this book on your Nook, you'll most likely have to embed a Greek font, e.g. Galatia SIL, which you can embed automatically with Calibre or manually with Sigil.) |
Advert | |
|
08-06-2013, 12:01 AM | #3 |
Junior Member
Posts: 2
Karma: 10
Join Date: Aug 2013
Device: Nook
|
Thank you Doitsu. Your answer is the first meaningful answer I have been given. It is clear that you took great care to understand my problem.
I believe you are correct about the cause of the garbage text in the Ebooks, since Google has removed them from the Google Books site. I previously had downloaded your suggestion from PG and the installation of Galatia SIL fixed the fonts issue I was having. However, there are still problems with line feeds, so I may just convert the UTF-8 version. Thanks again. |
08-06-2013, 05:27 AM | #4 | |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
As Doitsu mentioned, if the text is outside of the Latin character set, it is most likely to be a much lower quality OCR.
Quote:
Then you take into account markings/scanning artifacts/water damage/aging of the book, and the automatic OCR becomes even worse. Images -> Text is an incredibly hard area to get algorithms to do correctly without lots of human assistance. Project Gutenberg books are fed through multiple rounds of human assisted checking/editing, to try to get as accurate a conversion as possible. So if possible, try to look to Project Gutenberg first. A lot more information on Project Gutenberg's process can be found here: http://www.pgdp.net/c/faq/ProoferFAQ.php Last edited by Tex2002ans; 08-06-2013 at 05:33 AM. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Italics in Epub text | Jamestoo | Sigil | 7 | 11-09-2011 03:16 AM |
Text --> ePub | rpmazur | Conversion | 6 | 10-19-2011 07:23 AM |
Center align text in epub | virtual_ink | ePub | 23 | 08-31-2011 06:27 AM |
EPUB Overlapping Text - Please Help | coaver | Calibre | 16 | 07-27-2010 12:40 AM |
Justified text in ePub? | kiwik | ePub | 5 | 03-07-2009 02:35 PM |