View Full Version : Strange character appearing throughout e-book text


mag1
01-21-2010, 10:18 AM
Hello everyone,

I'm new to this forum so please forgive me if this question has already come up. I recently bought the Night's Dawn trilogy of books by Peter F Hamilton in EPUB format from Waterstones. The first two books appear just fine when I view and read them in Adobe Digital Editions (can no longer use my Sony PRS-505 due to a progressive disability - MND/ALS). However, the text in the last book The Naked God seems to be corrupted with the character which appears throughout the book replacing whatever should be there. I tried contacting Waterstone's customer service department by e-mail asking for help but after two weeks all they did was refund my money without a single word of explanation. I've not downloaded any further books from their website yet as I am unsure if the problem exists at their end or mine.

Have any of you folks experienced a similar problem with missing characters being substituted by the character? Were you able to resolve the problem? Any help would be greatly appreciated. Thank you.

Kind regards
Mark

omk3
01-21-2010, 10:30 AM
Hello Mark and welcome to Mobileread.

I had a similar problem with another book I bought from Waterstones - the character kept appearing (usually at the top of the page), which wouldn't bother me that much if the book didn't have a lot of other, more serious problems as well, like every french word in there being just garbled random letters.

I'm pretty sure the character is not a problem at your end - ebooks should be designed so that they can be correctly read by anyone, end of story.

I'm very surprised you got your money back, because I am now in my third week of correspondence with them about my book and have not received a useful response yet...

charleski
01-22-2010, 11:23 AM
This is a problem with the encoding specified. I've seen it myself a couple of times.
If your ebook is unencrypted, take a look at the xml specification at the top of each file. If the text was edited as UTF-8 but the encoding says <?xml version="1.0" encoding="ISO-8859-1"?> then you can get strange characters.

omk3
01-22-2010, 12:01 PM
Charleski, that was exactly it! They had charset utf-8 and encoding ISO-8859-1!
I changed it to utf-8 and the on the beginning of chapters disappeared, and all accented characters are not gibberish anymore! Thank you!

I get really angry when I think that it was a commercial book! (I run it through the epub validator, and even after me fixing the encoding there were other errors present!) I still have got no useful answer from waterstones, and of course not only should I not have to tamper with a purchased book, but I'm actually not allowed to! (Many thanks to all the people that provided us with ways to get round evil drm, once again!)

mag1
01-22-2010, 03:17 PM
Thank you omk3 for the welcome!

Thank you Charleski for that information but as I'm new to all of this I think I might need some further help. How do I view the XML specification of the EPUB file? As the book was bought commercially from Waterstones I would assume that it is encrypted and protected by DRM. Do I need a special piece of software to view and modify the specification? Do I need another piece of software to first strip out the DRM? Is that possible? Would it be possible please to detail the steps required to correct the encoding and what software would be required? Thank you.

omk3
01-23-2010, 03:12 PM
See this thread: http://www.mobileread.com/forums/showthread.php?t=39423&highlight=epub+circumvented

If you have a book that has no drm, you can just change the extension from epub to zip, the ebook files are inside this zip. What you will be interested in will be a lot of html or xhtml files. You can edit them with a text editor and see what encoding and charset is used. After correcting them, you will have to repack the files and rename to epub again.
It is quite a lot of work, especially if you haven't done it before.

Obviously with commercial books, we shouldn't have to do anything in order to read them except for loading them on our reader. Moreover, because of drm, it is obvious that we are not even allowed to. And I still haven't got any money back, or been sent a corrected version of my book or even a promise that one is forthcoming... :angry: Waterstones is not going to see any of my money ever again as long as this is not resolved. That I managed to resolve it myself thanks to more helpful and knowledgeable people that the waterstones' support team has nothing to do with it!

JSWolf
01-23-2010, 06:30 PM
See this thread: http://www.mobileread.com/forums/showthread.php?t=39423&highlight=epub+circumvented

http://www.mobileread.com/forums/showthread.php?t=39423

Use that URL instead and you won't end up with highlighted words.

omk3
01-23-2010, 06:32 PM
Oops! :o

charleski
01-26-2010, 08:46 AM
I agree wholeheartedly. Users shouldn't have to edit commercial books just to make them readable.

The fault actually lies with Macmillan. They're the ones who publish Peter F. Hamilton, and they also published the books I found to be faulty. Write to them to complain and tell them to fire the idiot they have who can't follow basic principles.

To make the edits, first unencrypt your book as detailed in the thread linked above. There are a couple of ways to fix the book, both are free. Be sure to work on a copy of the book just in case.

The simplest is to install Sigil (http://code.google.com/p/sigil/). Just open the epub in Sigil and save it, don't do anything else. Sigil strips out a lot of code that isn't strictly necessary, and will also strip out the faulty encoding parameter. This should work fine with faulty English-language books from Macmillan, but sometimes Sigil can mess up the ToC.

The alternative, which involves less radical change to the original code, is to open the book in an application like epubtweak (http://atlantiswordprocessor.blogspot.com/2009/11/tweaking-epubs-its-just-zip-file.html), which is free and a convenient way of looking inside the epub. Go through the list of files, selecting each .html file and clicking 'Edit File', which will make it come up in Notepad. Then just take a look at the first line of the file and change
encoding="ISO-8859-1"
to
encoding="utf-8"
then save the file and move on to the next.

omk3
01-26-2010, 01:27 PM
Mine was published by randomhouse... And it was not the first epub with errors I buy, though it was the only one with this specific error. Others had a lot of ocr mistakes here and there.

DaleDe
01-26-2010, 01:38 PM
Mine was published by randomhouse... And it was not the first epub with errors I buy, though it was the only one with this specific error. Others had a lot of ocr mistakes here and there.

These are very different errors. OCR errors are due to a lack of proofreading while incorrect headers is pure stupidity.

Dale

omk3
01-26-2010, 01:39 PM
I know. But both are because of lack of proper care.

JSWolf
01-26-2010, 01:53 PM
The alternative, which involves less radical change to the original code, is to open the book in an application like epubtweak (http://atlantiswordprocessor.blogspot.com/2009/11/tweaking-epubs-its-just-zip-file.html), which is free and a convenient way of looking inside the epub. Go through the list of files, selecting each .html file and clicking 'Edit File', which will make it come up in Notepad. Then just take a look at the first line of the file and change
encoding="ISO-8859-1"
to
encoding="utf-8"
then save the file and move on to the next.

Actually, it would be a lot easier to edit the html files using Notepad++. You can open them all in different tabs and then do a search/replace among all open tabs. So once loaded, it's just a single simple search/replace and then a save all and done except for putting them back into the ePub.

DaleDe
01-26-2010, 02:15 PM
I know. But both are because of lack of proper care.

I don't think any amount of care will fix stupidity or ignorance. It takes education, not care.

Dale

omk3
01-26-2010, 02:19 PM
I don't disagree with you there. But if anyone ever cared enough to try and (proof)read the finished ebook before selling it, they would discover both the ocr errors and the stupid encoding one...

Valloric
01-26-2010, 03:04 PM
The simplest is to install Sigil (http://code.google.com/p/sigil/). Just open the epub in Sigil and save it, don't do anything else. Sigil strips out a lot of code that isn't strictly necessary, and will also strip out the faulty encoding parameter.

Actually no. Sigil (actually embedded HTML Tidy) fixes most markup errors and also extracts any inline CSS into a style tag, but it won't strip out code. Although Tidy can refactor some parts of the markup on rare occasions. It will also pretty-print it, but that's just whitespace.

I put great emphasis on preserving the user's original code.

charleski
01-26-2010, 09:10 PM
Actually no. Sigil (actually embedded HTML Tidy) fixes most markup errors and also extracts any inline CSS into a style tag, but it won't strip out code. Although Tidy can refactor some parts of the markup on rare occasions. It will also pretty-print it, but that's just whitespace.

I put great emphasis on preserving the user's original code.
Sorry, but while Sigil is a very useful program, it (or HTML Tidy) engages in code refactoring with certain assumptions that results in some elements being lost. In the vast majority of cases this has no impact, or (as here) results in errors being automatically fixed.

But it has the possibility to introduce errors. I've attached 2 tiny epubs I made to demonstrate this.

This first ('Sigil test Original.epub') was deliberately created using ANSI encoding in Notepad++ and (correctly) specifies ISO-8859-1 encoding in the xml specification. The accented '' appears correctly in both ADE and Calibre's epub reader.

The second ('Sigil test opened in Sigil.epub') is the same file which has simply been opened in Sigil and immediately saved without any editing. the '' has now become a '?' in ADE and Calibre, because Sigil assumed that the encoding was utf-8, disregarding the encoding specified in the file, and changed the encoding attribute in the specification.

I don't know what you'd call this, but I'd say that was a significant change in the code. I don't think it's something that necessarily needs to be fixed, and as I said before, this behaviour can be used to fix sloppy mistakes without the user needing to know much about what they're doing. But it is something that Sigil users need to be aware of - Sigil rigidly assumes that all the text it processes is UTF-8, and any edits need to be made with that in mind. For Western languages this isn't a issue, and in fact the use of UTF-8 should be encouraged - there's no reason for people to be using ancient ANSI encoding in epubs. But it might be a problem for those who need to use UTF-16.

Sigil also strips out metadata elements in the body text xhtml that are irrelevant. Again, not a big problem for most users, though if you have a workflow that uses custom metadata fields it's something you really need to know about. If you look at the html inside the two epubs you'll see that's happened here, the custom metadata has been stripped.

Valloric
01-27-2010, 08:13 AM
This first ('Sigil test Original.epub') was deliberately created using ANSI encoding in Notepad++ and (correctly) specifies ISO-8859-1 encoding in the xml specification. The accented '' appears correctly in both ADE and Calibre's epub reader.

The second ('Sigil test opened in Sigil.epub') is the same file which has simply been opened in Sigil and immediately saved without any editing. the '' has now become a '?' in ADE and Calibre, because Sigil assumed that the encoding was utf-8, disregarding the encoding specified in the file, and changed the encoding attribute in the specification.

Sigil doesn't just "change" the encoding attribute and then pretend everything will work right. Come on. It tries to recognize the original encoding of the file and convert the text to UTF-8. After some recent changes, it's actually become pretty successful at this.


I don't know what you'd call this, but I'd say that was a significant change in the code. I don't think it's something that necessarily needs to be fixed, and as I said before, this behaviour can be used to fix sloppy mistakes without the user needing to know much about what they're doing. But it is something that Sigil users need to be aware of - Sigil rigidly assumes that all the text it processes is UTF-8, and any edits need to be made with that in mind.

That's patently false. Read this blog post (http://sigildev.blogspot.com/2009/12/encodings-and-why-you-shouldnt-trust.html).

But bugs are always possible. I'll check what's going on with this file and report back.

In general, you should report any problems on the tracker so they get scheduled and fixed. You're not doing anyone any favors (least of all yourself) by not reporting bugs. I've heard people feel like reporting bugs or missing features is "undue criticism": couldn't be farther from the truth. The more bug reports, the better.


Sigil also strips out metadata elements in the body text xhtml that are irrelevant. Again, not a big problem for most users, though if you have a workflow that uses custom metadata fields it's something you really need to know about. If you look at the html inside the two epubs you'll see that's happened here, the custom metadata has been stripped.

Actually this is something that's being worked on. Sigil should preserve your custom metadata, I completely agree. See this thread (http://www.mobileread.com/forums/showthread.php?t=66915) for the metadata discussion.

Valloric
01-27-2010, 08:31 AM
This first ('Sigil test Original.epub') was deliberately created using ANSI encoding in Notepad++ and (correctly) specifies ISO-8859-1 encoding in the xml specification. The accented '' appears correctly in both ADE and Calibre's epub reader.

The second ('Sigil test opened in Sigil.epub') is the same file which has simply been opened in Sigil and immediately saved without any editing. the '' has now become a '?' in ADE and Calibre, because Sigil assumed that the encoding was utf-8, disregarding the encoding specified in the file, and changed the encoding attribute in the specification.

And there's your problem.

Try to directly load the XHTML in the first epub file. The accented "e" is preserved, since the encoding is correctly detected and the files converted to UTF-8, just like I said.

If you load the epub file, the accented "e" becomes a question mark. Why?

Because what you have is an ISO-8859-1 encoded XHTML file inside and epub file, and that's against the epub specification. The only encoding allowed in XHTML files present in the epub specification are UTF-8 and UTF-16. You are not allowed to use something else (like ISO-8859-1).


1.4.1.2: XHTML Content Document Requirements (http://www.idpf.org/2007/ops/OPS_2.0_final_spec.html#TOC1.4.1.2)

A conformant XHTML Content Document must meet these conditions:


it is a well-formed XML document (as defined by XML 1.1); and
it is encoded in UTF-8 or UTF-16; and
it is a valid XML document according to the NVDL schema interaction provided in Appendix A (http://www.idpf.org/2007/ops/OPS_2.0_final_spec.html#AppendixA); and
it has a MIME media type of either application/xhtml+xml or text/x-oeb1-document (deprecated); and
all XHTML elements and attributes not contained in an Inline XML Island are drawn from the XHTML subset identified in this document.




So your file is bad. Sigil is doing the correct thing by assuming the XHTML files in the epub will be either UTF-8 or UTF-16.

But I'm going to change that. I'm going to perform the same encoding detection analysis on XHTML files in the epubs as I do when an (X)HTML file is loaded directly. Why? Because someone not familiar with the epub spec will do the same thing you did and expect everything to work. Sigil should be able to detect this error and correct it, as it can for markup.

And it will, next version onwards.

EDIT: This is now in trunk (http://code.google.com/p/sigil/source/detail?r=1be9496dd9d23b2532606604307a3b7ef6309a96# ).

charleski
01-27-2010, 08:43 PM
Just calm down Valloric. This is getting ridiculous. As far as I'm concerned, this behaviour is a feature, and I presented it as such.

If you want to fix the issue with stripping metadata, fine, but I'd regard that as a very low priority. And yes, you're right about the spec, the file passed epubcheck 1.0.4, but that version contains a bug that doesn't spot encoding errors (which has been fixed, but not released, doh).

I just did a quick test using an epub with some xhtml encoded in (and specified as) UTF-16 (displayed fine in ADE) and Sigil just stripped all the text and output a UTF-8 file. So yeah, that probably needs to be fixed, and if I'd confirmed the UTF-16 problem before now I'd certainly have listed it on your issues tracker. In the circumstances of the issue which this thread concerns, Sigil's assumption that the contents of an ePub are UTF-8 is a useful attribute that makes it easy to fix an apparently common problem.

Both my points were correct, but you've chosen to reply in an overly aggressive manner. I know how easy it is to get worked up over personal projects, but frankly I didn't expect to get attacked for recommending Sigil as a tool to solve people's problems. I regard this thread as closed, I won't be looking at it again.

Valloric
01-28-2010, 07:14 AM
Just calm down Valloric. This is getting ridiculous.

...

you've chosen to reply in an overly aggressive manner. I know how easy it is to get worked up over personal projects, but frankly I didn't expect to get attacked for recommending Sigil as a tool to solve people's problems. I regard this thread as closed, I won't be looking at it again.

Umm... aggressive? I certainly didn't mean to be. I'm perfectly calm. If I came across as aggressive or argumentative, you have my heartfelt apologies, honestly.

But I've reread my responses now a few times and I still can't see any "aggression".


Both my points were correct

I can't agree here.

I just did a quick test using an epub with some xhtml encoded in (and specified as) UTF-16 (displayed fine in ADE) and Sigil just stripped all the text and output a UTF-8 file. So yeah, that probably needs to be fixed, and if I'd confirmed the UTF-16 problem before now I'd certainly have listed it on your issues tracker. In the circumstances of the issue which this thread concerns, Sigil's assumption that the contents of an ePub are UTF-8 is a useful attribute that makes it easy to fix an apparently common problem.

Well create an issue on the tracker for that (and attach the epub). I'm deeply interested, UTF-16 should be working just fine. I'm using a Qt function for loading Unicode encoded text files in the current version, and this loads all UTF variants without problems... at least the last time I tested it.

mag1
02-01-2010, 07:01 AM
Hi guys,

Thank you for all your comments and assistance. It is very much appreciated. However these replies just go to illustrate, in my mind at least, that we still have a long way to go before e-books are a reliable medium suitable for anyone. If I've understood the problem correctly the issue seems to be with the encoding, so presumably no matter where I buy the book from it will have the same encoding errors which stem from the master copy put out by the publisher?

I would imagine that most people who buy e-books just want to read them. The fact that it becomes necessary to dismantle the files, download and install various other programs to overcome the DRM and correct the errors, and then reassemble the files seems vastly overcomplicated. The publisher should simply acknowledge their errors and supply the customer with a correctly formatted file. Waterstones, the online bookstore I bought the novel from were hopeless when it came to providing technical support. In fact after several weeks of waiting for a reply all they did was reimburse my money without a single word of explanation. So I still have no idea if it would be safe to download again either from them or from another website. Contrast this with how easy it is to exchange a book in a real shop which is done in minutes.

Mcmillan clearly have a quality control problem if they can release books where every page has encoding errors. I think my next step will be to write to Mcmillan and explained the situation and see what their response will be (if any). I will then return to this thread and update it with their reply.

Kind regards

Mark