![]() |
#1 |
eBook FANatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 18,301
Karma: 16078357
Join Date: Apr 2008
Location: Alabama, USA
Device: HP ipac RX5915 Wife's Kindle
|
Sigil, UTF-8 and the emdash
I have just run across a strange occurance with Sigil. I loaded a HTML file to Sigil and noticed all of the emdashs (—) had disappeared. The original file had 250 occurances of the emdash.
I went back to the original HTML file and changed the character set from Windows 1252: Western European to UTF-8 (which Sigil uses) and all of the emdashes disappeared. I then went back to Windows 1252: Western European and replaced all (—) with amp#8212; , converted back to UTF-8 and all emdashes re-appeared. I then loaded to Sigil and all emdashes were present. This appears to be a UTF-8 problem. As a pre-process, all HTML files will have to be edited prior to loading to Sigil unless someone has come up with a work-around. Are there any other characters to watch for? Curiouser and curiouser!! Last edited by crutledge; 06-29-2010 at 03:28 PM. |
![]() |
![]() |
![]() |
#2 | |
Booklegger
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,801
Karma: 7999816
Join Date: Jun 2009
Location: Toronto, Ontario, Canada
Device: BeBook(1 & 2010), PEZ, PRS-505, Kobo BT, PRS-T1, Playbook, Kobo Touch
|
Quote:
According to the epub spec, you should only be using UTF-8 for your text input. If I ran the Sigil world, Sigil wouldn't even try to edit your file if it declared a different encoding. So yes, you need something that will translate from W-1252 to UTF-8. |
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Not who you think I am...
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 374
Karma: 30283
Join Date: Jan 2010
Location: Honolulu
Device: PocketBook 360 -- Ivory
|
I use a virtual machine of Win2K for my favorite editor -- as part of the process of creating the text files I run everything through iconv for Windows.
Makes it easy to make batch files and macros that do the conversion to UTF-8. You have to be comfortable with the command line, but it's super-convenient. |
![]() |
![]() |
![]() |
#4 | |
Well trained by Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 30,883
Karma: 59840450
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
things from 1 world to another: *nix<->Mac<->Win Even if Sigil can't (won't) fix dual encoding things, I would like it to Pop up a Stop message, rather than silently throwing characters away or trashing them completely. |
|
![]() |
![]() |
![]() |
#5 |
Reader
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 520
Karma: 24612
Join Date: Aug 2009
Location: Utrecht, NL
Device: Kobo Aura 2, iPhone, iPad
|
I tried to load a HTML file in Sigil, coded in Windows-1252 with an em dash in it, and there were no problems. The emdash showed as such in Sigil. I checked with a hex editor that the em dash was really coded in windows-1252 (hex 97). I saved it to epub, and Sigil had converted the whole thing to utf-8. The em dash shows perfect in ADE. This is on Mac OS X, so that may be a difference. However, I would suggest to check thoroughly if your HTML file is correct (for example does it really use the Windows-1252 code for the me dash, and does it not state two conflicting encodings). But I also advice, as others have posted above, to use utf-8 for all your files. It is a much better encoding.
|
![]() |
![]() |
Advert | |
|
![]() |
#6 | |||
Created Sigil, FlightCrew
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,982
Karma: 350515
Join Date: Feb 2008
Device: Kobo Clara HD
|
Quote:
Quote:
Just saying "sorry, I can't open this" would be silly. Quote:
![]() BTW it works the same on all platforms. While I also suggest the use of Unicode encodings whenever possible, users are completely free to use any encoding they wish for their input files as long as the files state the encoding in use. Without that, it's anyone's guess. |
|||
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
epub and utf-8 | youssef | Ectaco jetBook | 0 | 01-15-2010 07:08 PM |
Encoding of Emdash | crutledge | Workshop | 10 | 10-27-2009 08:31 PM |
UTF-8 | tompe | Calibre | 2 | 05-06-2009 06:35 PM |
Emdash - punctuation macro | ProDigit | Sony Reader | 8 | 11-28-2008 02:32 AM |
More emdash woes | Patricia | Sony Reader | 10 | 07-06-2007 04:32 PM |