Strange Â character appearing throughout e-book text - Page 2

Valloric · 01-26-2010, 04:04 PM

Quote:

Originally Posted by charleski

The simplest is to install Sigil. Just open the epub in Sigil and save it, don't do anything else. Sigil strips out a lot of code that isn't strictly necessary, and will also strip out the faulty encoding parameter.

Actually no. Sigil (actually embedded HTML Tidy) fixes most markup errors and also extracts any inline CSS into a style tag, but it won't strip out code. Although Tidy can refactor some parts of the markup on rare occasions. It will also pretty-print it, but that's just whitespace.

I put great emphasis on preserving the user's original code.

charleski · 01-26-2010, 10:10 PM

Quote:

Originally Posted by Valloric

Actually no. Sigil (actually embedded HTML Tidy) fixes most markup errors and also extracts any inline CSS into a style tag, but it won't strip out code. Although Tidy can refactor some parts of the markup on rare occasions. It will also pretty-print it, but that's just whitespace.

I put great emphasis on preserving the user's original code.

Sorry, but while Sigil is a very useful program, it (or HTML Tidy) engages in code refactoring with certain assumptions that results in some elements being lost. In the vast majority of cases this has no impact, or (as here) results in errors being automatically fixed.

But it has the possibility to introduce errors. I've attached 2 tiny epubs I made to demonstrate this.

This first ('Sigil test Original.epub') was deliberately created using ANSI encoding in Notepad++ and (correctly) specifies ISO-8859-1 encoding in the xml specification. The accented 'é' appears correctly in both ADE and Calibre's epub reader.

The second ('Sigil test opened in Sigil.epub') is the same file which has simply been opened in Sigil and immediately saved without any editing. the 'é' has now become a '?' in ADE and Calibre, because Sigil assumed that the encoding was utf-8, disregarding the encoding specified in the file, and changed the encoding attribute in the specification.

I don't know what you'd call this, but I'd say that was a significant change in the code. I don't think it's something that necessarily needs to be fixed, and as I said before, this behaviour can be used to fix sloppy mistakes without the user needing to know much about what they're doing. But it is something that Sigil users need to be aware of - Sigil rigidly assumes that all the text it processes is UTF-8, and any edits need to be made with that in mind. For Western languages this isn't a issue, and in fact the use of UTF-8 should be encouraged - there's no reason for people to be using ancient ANSI encoding in epubs. But it might be a problem for those who need to use UTF-16.

Sigil also strips out metadata elements in the body text xhtml that are irrelevant. Again, not a big problem for most users, though if you have a workflow that uses custom metadata fields it's something you really need to know about. If you look at the html inside the two epubs you'll see that's happened here, the custom metadata has been stripped.

Valloric · 01-27-2010, 09:13 AM

Quote:

Originally Posted by charleski

This first ('Sigil test Original.epub') was deliberately created using ANSI encoding in Notepad++ and (correctly) specifies ISO-8859-1 encoding in the xml specification. The accented 'é' appears correctly in both ADE and Calibre's epub reader.

The second ('Sigil test opened in Sigil.epub') is the same file which has simply been opened in Sigil and immediately saved without any editing. the 'é' has now become a '?' in ADE and Calibre, because Sigil assumed that the encoding was utf-8, disregarding the encoding specified in the file, and changed the encoding attribute in the specification.

Sigil doesn't just "change" the encoding attribute and then pretend everything will work right. Come on. It tries to recognize the original encoding of the file and convert the text to UTF-8. After some recent changes, it's actually become pretty successful at this.

Quote:

Originally Posted by charleski

I don't know what you'd call this, but I'd say that was a significant change in the code. I don't think it's something that necessarily needs to be fixed, and as I said before, this behaviour can be used to fix sloppy mistakes without the user needing to know much about what they're doing. But it is something that Sigil users need to be aware of - Sigil rigidly assumes that all the text it processes is UTF-8, and any edits need to be made with that in mind.

That's patently false. Read this blog post.

But bugs are always possible. I'll check what's going on with this file and report back.

In general, you should report any problems on the tracker so they get scheduled and fixed. You're not doing anyone any favors (least of all yourself) by not reporting bugs. I've heard people feel like reporting bugs or missing features is "undue criticism": couldn't be farther from the truth. The more bug reports, the better.

Quote:

Originally Posted by charleski

Sigil also strips out metadata elements in the body text xhtml that are irrelevant. Again, not a big problem for most users, though if you have a workflow that uses custom metadata fields it's something you really need to know about. If you look at the html inside the two epubs you'll see that's happened here, the custom metadata has been stripped.

Actually this is something that's being worked on. Sigil should preserve your custom metadata, I completely agree. See this thread for the metadata discussion.

Valloric · 01-27-2010, 09:31 AM

Quote:

Originally Posted by charleski

This first ('Sigil test Original.epub') was deliberately created using ANSI encoding in Notepad++ and (correctly) specifies ISO-8859-1 encoding in the xml specification. The accented 'é' appears correctly in both ADE and Calibre's epub reader.

The second ('Sigil test opened in Sigil.epub') is the same file which has simply been opened in Sigil and immediately saved without any editing. the 'é' has now become a '?' in ADE and Calibre, because Sigil assumed that the encoding was utf-8, disregarding the encoding specified in the file, and changed the encoding attribute in the specification.

And there's your problem.

Try to directly load the XHTML in the first epub file. The accented "e" is preserved, since the encoding is correctly detected and the files converted to UTF-8, just like I said.

If you load the epub file, the accented "e" becomes a question mark. Why?

Because what you have is an ISO-8859-1 encoded XHTML file inside and epub file, and that's against the epub specification. The only encoding allowed in XHTML files present in the epub specification are UTF-8 and UTF-16. You are not allowed to use something else (like ISO-8859-1).

Quote:

Originally Posted by OPS Specification

1.4.1.2: XHTML Content Document Requirements

A conformant XHTML Content Document must meet these conditions:

it is a well-formed XML document (as defined by XML 1.1); and
it is encoded in UTF-8 or UTF-16; and
it is a valid XML document according to the NVDL schema interaction provided in Appendix A; and
it has a MIME media type of either application/xhtml+xml or text/x-oeb1-document (deprecated); and
all XHTML elements and attributes not contained in an Inline XML Island are drawn from the XHTML subset identified in this document.

So your file is bad. Sigil is doing the correct thing by assuming the XHTML files in the epub will be either UTF-8 or UTF-16.

But I'm going to change that. I'm going to perform the same encoding detection analysis on XHTML files in the epubs as I do when an (X)HTML file is loaded directly. Why? Because someone not familiar with the epub spec will do the same thing you did and expect everything to work. Sigil should be able to detect this error and correct it, as it can for markup.

And it will, next version onwards.

EDIT: This is now in trunk.

charleski · 01-27-2010, 09:43 PM

Just calm down Valloric. This is getting ridiculous. As far as I'm concerned, this behaviour is a feature, and I presented it as such.

If you want to fix the issue with stripping metadata, fine, but I'd regard that as a very low priority. And yes, you're right about the spec, the file passed epubcheck 1.0.4, but that version contains a bug that doesn't spot encoding errors (which has been fixed, but not released, doh).

I just did a quick test using an epub with some xhtml encoded in (and specified as) UTF-16 (displayed fine in ADE) and Sigil just stripped all the text and output a UTF-8 file. So yeah, that probably needs to be fixed, and if I'd confirmed the UTF-16 problem before now I'd certainly have listed it on your issues tracker. In the circumstances of the issue which this thread concerns, Sigil's assumption that the contents of an ePub are UTF-8 is a useful attribute that makes it easy to fix an apparently common problem.

Both my points were correct, but you've chosen to reply in an overly aggressive manner. I know how easy it is to get worked up over personal projects, but frankly I didn't expect to get attacked for recommending Sigil as a tool to solve people's problems. I regard this thread as closed, I won't be looking at it again.

Valloric · 01-28-2010, 08:14 AM

Quote:

Originally Posted by charleski

Just calm down Valloric. This is getting ridiculous.

...

you've chosen to reply in an overly aggressive manner. I know how easy it is to get worked up over personal projects, but frankly I didn't expect to get attacked for recommending Sigil as a tool to solve people's problems. I regard this thread as closed, I won't be looking at it again.

Umm... aggressive? I certainly didn't mean to be. I'm perfectly calm. If I came across as aggressive or argumentative, you have my heartfelt apologies, honestly.

But I've reread my responses now a few times and I still can't see any "aggression".

Quote:

Originally Posted by charleski

Both my points were correct

I can't agree here.

Quote:

Originally Posted by charleski

I just did a quick test using an epub with some xhtml encoded in (and specified as) UTF-16 (displayed fine in ADE) and Sigil just stripped all the text and output a UTF-8 file. So yeah, that probably needs to be fixed, and if I'd confirmed the UTF-16 problem before now I'd certainly have listed it on your issues tracker. In the circumstances of the issue which this thread concerns, Sigil's assumption that the contents of an ePub are UTF-8 is a useful attribute that makes it easy to fix an apparently common problem.

Well create an issue on the tracker for that (and attach the epub). I'm deeply interested, UTF-16 should be working just fine. I'm using a Qt function for loading Unicode encoded text files in the current version, and this loads all UTF variants without problems... at least the last time I tested it.

mag1 · 02-01-2010, 08:01 AM

Hi guys,

Thank you for all your comments and assistance. It is very much appreciated. However these replies just go to illustrate, in my mind at least, that we still have a long way to go before e-books are a reliable medium suitable for anyone. If I've understood the problem correctly the issue seems to be with the encoding, so presumably no matter where I buy the book from it will have the same encoding errors which stem from the master copy put out by the publisher?

I would imagine that most people who buy e-books just want to read them. The fact that it becomes necessary to dismantle the files, download and install various other programs to overcome the DRM and correct the errors, and then reassemble the files seems vastly overcomplicated. The publisher should simply acknowledge their errors and supply the customer with a correctly formatted file. Waterstones, the online bookstore I bought the novel from were hopeless when it came to providing technical support. In fact after several weeks of waiting for a reply all they did was reimburse my money without a single word of explanation. So I still have no idea if it would be safe to download again either from them or from another website. Contrast this with how easy it is to exchange a book in a real shop which is done in minutes.

Mcmillan clearly have a quality control problem if they can release books where every page has encoding errors. I think my next step will be to write to Mcmillan and explained the situation and see what their response will be (if any). I will then return to this thread and update it with their reply.

Kind regards

Mark

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Light/thin character block on opening a book	trampas	Amazon Kindle	4	09-15-2010 02:29 AM
Strange text in homemade theme	ArchCarrier	PocketBook	9	03-26-2010 08:48 PM
Strange behaviour of TOC for one character	paulpeer	Calibre	6	03-07-2010 01:03 PM
Strange pagination in 1 book in Stanza	ChristopherTD	Apple Devices	3	11-25-2009 03:59 AM
Strange Book Designer Problem	dordale	Workshop	2	01-16-2009 09:53 AM

01-27-2010, 09:43 PM	#20
charleski Wizard Posts: 1,196 Karma: 1281258 Join Date: Sep 2009 Device: PRS-505	Just calm down Valloric. This is getting ridiculous. As far as I'm concerned, this behaviour is a feature, and I presented it as such. If you want to fix the issue with stripping metadata, fine, but I'd regard that as a very low priority. And yes, you're right about the spec, the file passed epubcheck 1.0.4, but that version contains a bug that doesn't spot encoding errors (which has been fixed, but not released, doh). I just did a quick test using an epub with some xhtml encoded in (and specified as) UTF-16 (displayed fine in ADE) and Sigil just stripped all the text and output a UTF-8 file. So yeah, that probably needs to be fixed, and if I'd confirmed the UTF-16 problem before now I'd certainly have listed it on your issues tracker. In the circumstances of the issue which this thread concerns, Sigil's assumption that the contents of an ePub are UTF-8 is a useful attribute that makes it easy to fix an apparently common problem. Both my points were correct, but you've chosen to reply in an overly aggressive manner. I know how easy it is to get worked up over personal projects, but frankly I didn't expect to get attacked for recommending Sigil as a tool to solve people's problems. I regard this thread as closed, I won't be looking at it again.

02-01-2010, 08:01 AM	#22
mag1 Junior Member Posts: 3 Karma: 10 Join Date: Jan 2010 Location: Hampshire, United Kingdom Device: Sony Reader PRS505	Hi guys, Thank you for all your comments and assistance. It is very much appreciated. However these replies just go to illustrate, in my mind at least, that we still have a long way to go before e-books are a reliable medium suitable for anyone. If I've understood the problem correctly the issue seems to be with the encoding, so presumably no matter where I buy the book from it will have the same encoding errors which stem from the master copy put out by the publisher? I would imagine that most people who buy e-books just want to read them. The fact that it becomes necessary to dismantle the files, download and install various other programs to overcome the DRM and correct the errors, and then reassemble the files seems vastly overcomplicated. The publisher should simply acknowledge their errors and supply the customer with a correctly formatted file. Waterstones, the online bookstore I bought the novel from were hopeless when it came to providing technical support. In fact after several weeks of waiting for a reply all they did was reimburse my money without a single word of explanation. So I still have no idea if it would be safe to download again either from them or from another website. Contrast this with how easy it is to exchange a book in a real shop which is done in minutes. Mcmillan clearly have a quality control problem if they can release books where every page has encoding errors. I think my next step will be to write to Mcmillan and explained the situation and see what their response will be (if any). I will then return to this thread and update it with their reply. Kind regards Mark

Advert

Advert