MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Sigil (https://www.mobileread.com/forums/forumdisplay.php?f=203)
-   -   Encoding declaration in OPF and TOC? (https://www.mobileread.com/forums/showthread.php?t=76381)

paulpeer 03-08-2010 03:23 PM

Encoding declaration in OPF and TOC?
 
I've made a lot of books with Sigil and sometimes I import them in Calibre. After having done that, I see that often accented characters are shown in the wrong way (e.g. Chinese characters instead of Latin ones) in the TOC and in the meta data.

About the first problem (TOC) I wrote to the programmer of Calibre and his response was "Bug fixed: When decoding NCX toc files, if no encoding is declared and detection has less that 100% confidence, assume UTF-8."

So I understand that the TOC should have an encoding declaration. Can this be added so that Sigil does that automatically? As I understand Sigil delivers perfect utf-8 but doesn't declare so.

Also about the second problem (errors in the meta data) I wrote to Calibre, and the answer was similar: stick an encoding declaration in the OPF.

Hence my similar question: Can Sigil add an encoding declaration to the OPF?

Thanks!

Valloric 03-08-2010 03:55 PM

Quote:

Originally Posted by paulpeer (Post 821142)
Hence my similar question: Can Sigil add an encoding declaration to the OPF?

The XML standard states that an XML document without an "encoding" attribute in the XML declaration is encoded in either UTF-8 or UTF-16. If it is encoded in UTF-16, then it MUST have a Byte Order Mark. Therefore, if it doesn't have an encoding attribute and it doesn't have a BOM, it must be UTF-8.

In plain English, UTF-8 is the default character encoding for XML. I thought everyone knew that.

But I'll add the attribute, it can't hurt.

EDIT: And here's the source. Just invert the negatives. :)

Quote:

... it is a fatal error [...] for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8.

paulpeer 03-08-2010 04:18 PM

Quote:

Originally Posted by Valloric (Post 821174)
In plain English, UTF-8 is the default character encoding for XML. I thought everyone knew that.

I had the same opinion, but if a famous programmer says that I have to declare that my books are utf-8, I start doubting ;-)
Quote:

Originally Posted by Valloric (Post 821174)
But I'll add the attribute, it can't hurt.

Thanks! It will save a lot of trouble in many cases.

kovidgoyal 03-08-2010 04:21 PM

Oh if only everyone knew that and no one produced XML files encoded in encoding other than UTF-8 with no encoding declaration.

Valloric 03-08-2010 04:32 PM

Quote:

Originally Posted by kovidgoyal (Post 821199)
Oh if only everyone knew that and no one produced XML files encoded in encoding other than UTF-8 with no encoding declaration.

You should have said Kovid that this was causing you problems. It's no problem to add the encoding attribute. :)

But you really should fall back to the standard when byte stream fingerprinting isn't 100% sure of the encoding.

Valloric 03-08-2010 04:39 PM

This is now in trunk.

paulpeer 03-08-2010 04:47 PM

Quote:

Originally Posted by Valloric (Post 821212)
This is now in trunk.

You're marvellous, guys! :thanks:

kovidgoyal 03-08-2010 04:48 PM

Quote:

Originally Posted by Valloric (Post 821210)
You should have said Kovid that this was causing you problems. It's no problem to add the encoding attribute. :)

But you really should fall back to the standard when byte stream fingerprinting isn't 100% sure of the encoding.

The problem is that byte stream fingerprinting is almost never a hundred percent certain.


All times are GMT -4. The time now is 10:26 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.