MobileRead Forums - View Single Post - Sigil, UTF-8 and the emdash

Valloric · 06-30-2010, 12:35 PM

Quote:

Originally Posted by crutledge

I went back to the original HTML file and changed the character set from Windows 1252: Western European to UTF-8 (which Sigil uses) and all of the emdashes disappeared. I then went back to Windows 1252: Western European and replaced all (—) with amp#8212; , converted back to UTF-8 and all emdashes re-appeared. I then loaded to Sigil and all emdashes were present. This appears to be a UTF-8 problem.

Your file probably did one of two things:

It didn't state an encoding; if no encoding is specified in the file, UTF-8 is assumed. You need to specify an encoding in the file. Without it, your playing russian roulette every time you open it in any application.
It stated an incorrect encoding. Stating two different encodings also falls under this category.

Quote:

Originally Posted by pholy

If I ran the Sigil world, Sigil wouldn't even try to edit your file if it declared a different encoding.

Frankly, then it's better that you don't. Sigil automatically converts all files from several dozen different encodings into UTF-16 (and then into UTF-8 on export) as long as the file states the original encoding.

Just saying "sorry, I can't open this" would be silly.

Quote:

Originally Posted by pietvo

I tried to load a HTML file in Sigil, coded in Windows-1252 with an em dash in it, and there were no problems. The emdash showed as such in Sigil. I checked with a hex editor that the em dash was really coded in windows-1252 (hex 97). I saved it to epub, and Sigil had converted the whole thing to utf-8. The em dash shows perfect in ADE. This is on Mac OS X, so that may be a difference. However, I would suggest to check thoroughly if your HTML file is correct (for example does it really use the Windows-1252 code for the me dash, and does it not state two conflicting encodings). But I also advice, as others have posted above, to use utf-8 for all your files. It is a much better encoding.

Exactly.

BTW it works the same on all platforms.

While I also suggest the use of Unicode encodings whenever possible, users are completely free to use any encoding they wish for their input files as long as the files state the encoding in use. Without that, it's anyone's guess.