Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > ePub

Notices

Reply
 
Thread Tools Search this Thread
Old 08-04-2013, 11:19 PM   #1
tebo
Junior Member
tebo began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Aug 2013
Device: Nook
Illegible EPUB Text

I have a problem where the text of some EPUBs are rendering incorrectly in Nook PC (as well as calibre E-book reader). I have experienced the same issue with books downloaded from B&N as well as Gutenberg.

As an example from the book "A Greek-English lexicon of the New Testament: being Grimm's Wilke's Clavis ..."

From Content.opf:
Book digitized by Google and uploaded to the Internet Archive by user tpb.

From the HTML page metadata:
<meta content="abbyy to epub tool, v0.2" name="generator"/>
<meta content="application/xhtml+xml; charset=utf-8" http-equiv="Content-Type"/>

Example rendering:
Ιι 1ΐ35 κιίΓνϊνβϋ 1οη§ βηοιι^ ίοΓ Λε ςορ>τϊ§1ιΙ Ιο οχρϊτο 3ΐΐ(3 ΐΗο Ιχ)ο1; Ιο οηΙΟΓ ΐΗο ριιΒΠς »1οπΐ3Ϊη. Α

B&N tells me that the file is corrupt, but I've seen this in many different EPUBs. It seems to be an issue with rendering Unicode characters (there is a mixture of Greek and English in the above example).

Any ideas?
tebo is offline   Reply With Quote
Old 08-05-2013, 01:05 AM   #2
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,583
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
The garbage text is most likely caused by the automated OCR of non-Latin text. I'd recommend downloading this very similar PG Greek English NT lexicon instead.
(In order to read this book on your Nook, you'll most likely have to embed a Greek font, e.g. Galatia SIL, which you can embed automatically with Calibre or manually with Sigil.)
Doitsu is offline   Reply With Quote
Old 08-06-2013, 12:01 AM   #3
tebo
Junior Member
tebo began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Aug 2013
Device: Nook
Thank you Doitsu. Your answer is the first meaningful answer I have been given. It is clear that you took great care to understand my problem.

I believe you are correct about the cause of the garbage text in the Ebooks, since Google has removed them from the Google Books site.

I previously had downloaded your suggestion from PG and the installation of Galatia SIL fixed the fonts issue I was having. However, there are still problems with line feeds, so I may just convert the UTF-8 version.

Thanks again.
tebo is offline   Reply With Quote
Old 08-06-2013, 05:27 AM   #4
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
As Doitsu mentioned, if the text is outside of the Latin character set, it is most likely to be a much lower quality OCR.

Quote:
Originally Posted by tebo View Post
[...] Book digitized by Google and uploaded to the Internet Archive by user tpb.[...]
The text versions generated by Archive.org (and Google.com) are usually quite poor. All that happens on their end is that the scans of the book are automatically fed through OCR, and the text output is run through some templates to plop it into different format (EPUB, Kindle, plain TXT, ...).

Then you take into account markings/scanning artifacts/water damage/aging of the book, and the automatic OCR becomes even worse.

Images -> Text is an incredibly hard area to get algorithms to do correctly without lots of human assistance.

Project Gutenberg books are fed through multiple rounds of human assisted checking/editing, to try to get as accurate a conversion as possible. So if possible, try to look to Project Gutenberg first.

A lot more information on Project Gutenberg's process can be found here:

http://www.pgdp.net/c/faq/ProoferFAQ.php

Last edited by Tex2002ans; 08-06-2013 at 05:33 AM.
Tex2002ans is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Italics in Epub text Jamestoo Sigil 7 11-09-2011 03:16 AM
Text --> ePub rpmazur Conversion 6 10-19-2011 07:23 AM
Center align text in epub virtual_ink ePub 23 08-31-2011 06:27 AM
EPUB Overlapping Text - Please Help coaver Calibre 16 07-27-2010 12:40 AM
Justified text in ePub? kiwik ePub 5 03-07-2009 02:35 PM


All times are GMT -4. The time now is 12:11 PM.


MobileRead.com is a privately owned, operated and funded community.