View Single Post
Old 03-27-2014, 02:55 AM   #1
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
About UTF-16 parsing mistake

Hi

I did a curious and fairly reproductible experiment using the broken EPUB I presented to you yesterday. As I am more careful today, I'll let you decide if this is an Editor bug or an advanced feature...

I unchecked this box in the Editor preferences (screenshot 1). This means, if I understand correctly, that no UTF-16 character will be created to replace a named entity like nbsp. (of which there are 611 in the book).

Then I just modified one word on chapter 2 with the Editor (just change one word and back) and saved the file.

1. - If I open this same chapter 2 file with the Editor, there will be no reading problem but if I check the book, the Editor reports now an error for this modified chapter: "Parsing failed: Document labelled UTF-16 but has UTF-8 content, line 1..." (scr 4 - far right).

2. - Opening this file with Sigil 0.7.4, things are even more gloomy: Sigil gives a warning (scr 2). Looking at the files, I observed that the DOCTYPE and the nbsp have indeed been logically maintained but the modified chapter 2 file is declared unreadable on Sigil without any reason given (in fact it's unreadable because it's declared as UTF-16). If I try to open the chapter 2 xhtml file, it will look a little like Chinese but written by me (scr 3).

Changing UTF-16 with UTF-8 in the declaration solves all problems for both editors.

If the Editor cannot parse, if Sigil is bewildered by this change, then why do it?

Proposal. When the user unchecks the preferences checkbox alluded above (scr 1), not only the nbsp and DOCTYPE should be preserved like now, but the file should stay declared as UTF-8.
Attached Thumbnails
Click image for larger version

Name:	Préférences Editor.png
Views:	325
Size:	62.6 KB
ID:	120854   Click image for larger version

Name:	Sigil 0.7.4. report.png
Views:	320
Size:	19.0 KB
ID:	120855   Click image for larger version

Name:	Sigil 0.7.4 - sweet UTF-16.png
Views:	1193
Size:	647.9 KB
ID:	120856   Click image for larger version

Name:	Editor report.png
Views:	329
Size:	84.7 KB
ID:	120857  

Last edited by roger64; 03-27-2014 at 03:25 AM.
roger64 is offline   Reply With Quote