MobileRead Forums - View Single Post - Strange Â character appearing throughout e-book text

charleski · 01-26-2010, 09:10 PM

Quote:

Originally Posted by Valloric

Actually no. Sigil (actually embedded HTML Tidy) fixes most markup errors and also extracts any inline CSS into a style tag, but it won't strip out code. Although Tidy can refactor some parts of the markup on rare occasions. It will also pretty-print it, but that's just whitespace.

I put great emphasis on preserving the user's original code.

Sorry, but while Sigil is a very useful program, it (or HTML Tidy) engages in code refactoring with certain assumptions that results in some elements being lost. In the vast majority of cases this has no impact, or (as here) results in errors being automatically fixed.

But it has the possibility to introduce errors. I've attached 2 tiny epubs I made to demonstrate this.

This first ('Sigil test Original.epub') was deliberately created using ANSI encoding in Notepad++ and (correctly) specifies ISO-8859-1 encoding in the xml specification. The accented 'é' appears correctly in both ADE and Calibre's epub reader.

The second ('Sigil test opened in Sigil.epub') is the same file which has simply been opened in Sigil and immediately saved without any editing. the 'é' has now become a '?' in ADE and Calibre, because Sigil assumed that the encoding was utf-8, disregarding the encoding specified in the file, and changed the encoding attribute in the specification.

I don't know what you'd call this, but I'd say that was a significant change in the code. I don't think it's something that necessarily needs to be fixed, and as I said before, this behaviour can be used to fix sloppy mistakes without the user needing to know much about what they're doing. But it is something that Sigil users need to be aware of - Sigil rigidly assumes that all the text it processes is UTF-8, and any edits need to be made with that in mind. For Western languages this isn't a issue, and in fact the use of UTF-8 should be encouraged - there's no reason for people to be using ancient ANSI encoding in epubs. But it might be a problem for those who need to use UTF-16.

Sigil also strips out metadata elements in the body text xhtml that are irrelevant. Again, not a big problem for most users, though if you have a workflow that uses custom metadata fields it's something you really need to know about. If you look at the html inside the two epubs you'll see that's happened here, the custom metadata has been stripped.