Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 06-29-2010, 03:26 PM   #1
crutledge
eBook FANatic
crutledge ought to be getting tired of karma fortunes by now.crutledge ought to be getting tired of karma fortunes by now.crutledge ought to be getting tired of karma fortunes by now.crutledge ought to be getting tired of karma fortunes by now.crutledge ought to be getting tired of karma fortunes by now.crutledge ought to be getting tired of karma fortunes by now.crutledge ought to be getting tired of karma fortunes by now.crutledge ought to be getting tired of karma fortunes by now.crutledge ought to be getting tired of karma fortunes by now.crutledge ought to be getting tired of karma fortunes by now.crutledge ought to be getting tired of karma fortunes by now.
 
crutledge's Avatar
 
Posts: 15,518
Karma: 13575467
Join Date: Apr 2008
Location: Alabama, USA
Device: HP ipac RX5915 Wife's Kindle
Sigil, UTF-8 and the emdash

I have just run across a strange occurance with Sigil. I loaded a HTML file to Sigil and noticed all of the emdashs (—) had disappeared. The original file had 250 occurances of the emdash.

I went back to the original HTML file and changed the character set from Windows 1252: Western European to UTF-8 (which Sigil uses) and all of the emdashes disappeared. I then went back to Windows 1252: Western European and replaced all (—) with amp#8212; , converted back to UTF-8 and all emdashes re-appeared. I then loaded to Sigil and all emdashes were present. This appears to be a UTF-8 problem.

As a pre-process, all HTML files will have to be edited prior to loading to Sigil unless someone has come up with a work-around. Are there any other characters to watch for?
Curiouser and curiouser!!

Last edited by crutledge; 06-29-2010 at 03:28 PM.
crutledge is offline   Reply With Quote
Old 06-29-2010, 11:38 PM   #2
pholy
Booklegger
pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.pholy ought to be getting tired of karma fortunes by now.
 
pholy's Avatar
 
Posts: 1,798
Karma: 7999034
Join Date: Jun 2009
Location: Toronto, Ontario, Canada
Device: BeBook(1 & 2010), PEZ, PRS-505, Kobo BT, PRS-T1, Playbook, Kobo Touch
Quote:
Are there any other characters to watch for?
I would guess that any character that had a numeric value in W-1252 different from its UTF-8 value. Curly quotes, maybe?

According to the epub spec, you should only be using UTF-8 for your text input. If I ran the Sigil world, Sigil wouldn't even try to edit your file if it declared a different encoding.

So yes, you need something that will translate from W-1252 to UTF-8.
pholy is offline   Reply With Quote
 
Enthusiast
Old 06-30-2010, 01:05 AM   #3
capidamonte
Not who you think I am...
capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!capidamonte , Klaatu Barada Niktu!
 
capidamonte's Avatar
 
Posts: 346
Karma: 5337
Join Date: Jan 2010
Location: Honolulu
Device: Sony PRS-350
I use a virtual machine of Win2K for my favorite editor -- as part of the process of creating the text files I run everything through iconv for Windows.

Makes it easy to make batch files and macros that do the conversion to UTF-8. You have to be comfortable with the command line, but it's super-convenient.
capidamonte is offline   Reply With Quote
Old 06-30-2010, 05:59 AM   #4
theducks
Grand Sorcerer
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 14,843
Karma: 5654321
Join Date: Aug 2009
Location: (The original) Silicon Valley, USA
Device: Galaxy Tab 2, Astak Pocket Pro, K4NT
Quote:
Originally Posted by capidamonte View Post
I use a virtual machine of Win2K for my favorite editor -- as part of the process of creating the text files I run everything through iconv for Windows.

Makes it easy to make batch files and macros that do the conversion to UTF-8. You have to be comfortable with the command line, but it's super-convenient.
Notepad++ has plugins that convert a whole lot of text
things from 1 world to another: *nix<->Mac<->Win

Even if Sigil can't (won't) fix dual encoding things, I would like it to Pop up a Stop message, rather than silently throwing characters away or trashing them completely.
theducks is offline   Reply With Quote
Old 06-30-2010, 08:56 AM   #5
pietvo
Reader
pietvo can name that song in three notespietvo can name that song in three notespietvo can name that song in three notespietvo can name that song in three notespietvo can name that song in three notespietvo can name that song in three notespietvo can name that song in three notespietvo can name that song in three notespietvo can name that song in three notespietvo can name that song in three notespietvo can name that song in three notes
 
pietvo's Avatar
 
Posts: 514
Karma: 24612
Join Date: Aug 2009
Location: Cochabamba, BO
Device: Onyx Boox 60, iPod Touch
I tried to load a HTML file in Sigil, coded in Windows-1252 with an em dash in it, and there were no problems. The emdash showed as such in Sigil. I checked with a hex editor that the em dash was really coded in windows-1252 (hex 97). I saved it to epub, and Sigil had converted the whole thing to utf-8. The em dash shows perfect in ADE. This is on Mac OS X, so that may be a difference. However, I would suggest to check thoroughly if your HTML file is correct (for example does it really use the Windows-1252 code for the me dash, and does it not state two conflicting encodings). But I also advice, as others have posted above, to use utf-8 for all your files. It is a much better encoding.
pietvo is offline   Reply With Quote
Old 06-30-2010, 12:35 PM   #6
Valloric
Created Sigil, FlightCrew
Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.
 
Valloric's Avatar
 
Posts: 1,978
Karma: 350515
Join Date: Feb 2008
Device: Sony Reader PRS 505
Quote:
Originally Posted by crutledge View Post
I went back to the original HTML file and changed the character set from Windows 1252: Western European to UTF-8 (which Sigil uses) and all of the emdashes disappeared. I then went back to Windows 1252: Western European and replaced all (—) with amp#8212; , converted back to UTF-8 and all emdashes re-appeared. I then loaded to Sigil and all emdashes were present. This appears to be a UTF-8 problem.
Your file probably did one of two things:
  1. It didn't state an encoding; if no encoding is specified in the file, UTF-8 is assumed. You need to specify an encoding in the file. Without it, your playing russian roulette every time you open it in any application.
  2. It stated an incorrect encoding. Stating two different encodings also falls under this category.
Quote:
Originally Posted by pholy View Post
If I ran the Sigil world, Sigil wouldn't even try to edit your file if it declared a different encoding.
Frankly, then it's better that you don't. Sigil automatically converts all files from several dozen different encodings into UTF-16 (and then into UTF-8 on export) as long as the file states the original encoding.

Just saying "sorry, I can't open this" would be silly.

Quote:
Originally Posted by pietvo View Post
I tried to load a HTML file in Sigil, coded in Windows-1252 with an em dash in it, and there were no problems. The emdash showed as such in Sigil. I checked with a hex editor that the em dash was really coded in windows-1252 (hex 97). I saved it to epub, and Sigil had converted the whole thing to utf-8. The em dash shows perfect in ADE. This is on Mac OS X, so that may be a difference. However, I would suggest to check thoroughly if your HTML file is correct (for example does it really use the Windows-1252 code for the me dash, and does it not state two conflicting encodings). But I also advice, as others have posted above, to use utf-8 for all your files. It is a much better encoding.
Exactly.

BTW it works the same on all platforms.

While I also suggest the use of Unicode encodings whenever possible, users are completely free to use any encoding they wish for their input files as long as the files state the encoding in use. Without that, it's anyone's guess.
Valloric is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
epub and utf-8 youssef Ectaco jetBook 0 01-15-2010 07:08 PM
Encoding of Emdash crutledge Workshop 10 10-27-2009 08:31 PM
UTF-8 tompe Calibre 2 05-06-2009 06:35 PM
Emdash - punctuation macro ProDigit Sony Reader 8 11-28-2008 02:32 AM
More emdash woes Patricia Sony Reader 10 07-06-2007 04:32 PM


All times are GMT -4. The time now is 01:02 AM.


MobileRead.com is a privately owned, operated and funded community.