![]() |
#1 |
Connoisseur
![]() Posts: 64
Karma: 10
Join Date: Sep 2015
Location: Yerevan, Armenia
Device: none
|
![]()
I created a minimal epub with all files in UTF-16.
When I just open the project in the Sigil - the program (without asking or even informing me) forcibly changes the encoding of all files (ncx, opf, xhtml, css, except .js in my epub - may be becouse it not used?) to UTF-8 (the files and the headers). When I add any UTF-16 file to epub - all the same. How can I prevent this? I don't want to save files as UTF-8. Thanks. |
![]() |
![]() |
![]() |
#2 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 5,687
Karma: 24031401
Join Date: Dec 2010
Device: Kindle PW2
|
You can't. Sigil will automatically save all text files as utf-8 files, except for files in the Misc folder.
BTW, even though the ePub 2.0.1 standard allows both utf-8 and utf-16 files, using utf-16 files will only blow up your files with no real benefits. For example, the file size of your original ALPHABET.xhtml utf-16 file is 58KB while the file size of the same file saved as utf-8 is only 30KB. Since the default reading app of some older eInk readers can't handle chapter files larger than 300 KB you may want to keep all chapter files as small as possible. (utf-16 is only beneficial for CJK files and even those files are only about 30% smaller when encoded as utf-16 files.) |
![]() |
![]() |
![]() |
#3 | |
Connoisseur
![]() Posts: 64
Karma: 10
Join Date: Sep 2015
Location: Yerevan, Armenia
Device: none
|
Quote:
Just strange that the Sigil makes that significant changes in files - on open/add - and say nothing. Is it hard to implement function - "Save As UTF-16 Copy..."? in the future If Sigil do UTF-16 -> UTF-8 --- let us revert it on Save ![]() Thanks a lot! PS We create some large and impotant html files in the MSVC C#, therefore work with UTF-16 files is fast and easy way. So as you see - we look for benefit not from ePub, but from supporting ePub in this case. Sometimes size does not matter ![]() |
|
![]() |
![]() |
![]() |
#4 |
Connoisseur
![]() Posts: 64
Karma: 10
Join Date: Sep 2015
Location: Yerevan, Armenia
Device: none
|
![]()
Sigil can open UTF-16 ePub,
Sigil can not save UTF-16 ePub, -------------- Has Sigil full UTF-16 support? |
![]() |
![]() |
![]() |
#5 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,491
Karma: 5703586
Join Date: Nov 2009
Device: many
|
Hi,
Sigil can read utf-16 files (or almost any properly encoded text file) but will convert it to the defacto standard of utf-8 when saving. In short ucs-2/utf-16 became popular when people thought 16 bit numbers (65K chars) would be enough. It now takes 21 bits to fully encode all possible unicode characters. So to be clear utf-16 requires multi-char escape sequences to represent all characters in unicode just like utf-8 does. The advantages of utf-8 over utf-16 is that it is not endian dependent (ie. is a stream of bytes), where as utf-16 values must either be stored as big-endian or little-endian byte order. Thus utf-16's need for an extra BOM to tell the end user what order the two bytes that make up each value are provided in. Some cpus (though few today) are big-endian natively (namely the sparc and ppc) while others are little-endian (Intel's x86 family) natively, in addition there is network endianness (generally big) to deal with when streaming files and/or serializing objects. The problem is there are many ucs-2 routines masquarading as utf-16 routines that will break things but only for for specific ranges of unicode characters. In addition most of the early e-readers would not work with utf-16. Because utf-8 can represent all unicode chars, needs no BOM mark (ie. is endian independent), and generally results in smaller file sizes (as Doitsu rightly points out) except for some CJK uses, and can work in all e-readers, it has become the defacto standard. And one that Sigil will not deviate from. If you want utf-16, simply create an output plugin for sigil. It can probably be done in just a handful of lines given python's strong encoding capabilities. The plugin simply needs to iterate over the application/xhtml+xml files, reading them in a utf-8 text, replace utf-8 with utf-16 in the xml header and then properly encode the file as utf-16 and add the right bom and write it out as binary data. Probably not more than a handful of lines in python. Perfect use for a plugin. KevinH |
![]() |
![]() |
![]() |
#6 | |
Connoisseur
![]() Posts: 64
Karma: 10
Join Date: Sep 2015
Location: Yerevan, Armenia
Device: none
|
Thanks to KevinH
Quote:
(unzip, reencode, set headers, zip) But 1) I not sure what to do with mimetype - the first file in zip - save it as ansi or utf-8 or utf-16... 2) Online tests say everything is OK, but I can't be sure for now... So I have to check and study a lot of matter. In any way - Thanks a lot for your detailed answer! |
|
![]() |
![]() |
![]() |
#7 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,491
Karma: 5703586
Join Date: Nov 2009
Device: many
|
Hi,
As far as I know, mimetype is simply pure ascii (1 byte = 1 char), should not be compressed, and should be the first file in the archive so that external programs can read the mimetype file without having to unpack the actual zip. That said, I have no idea if a utf-16 encoded mimetype would work. I have never tried it. Take care, Kevin |
![]() |
![]() |
![]() |
#8 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 13,316
Karma: 78876004
Join Date: Nov 2007
Location: Toronto
Device: Libra H2O, Libra Colour
|
https://en.wikipedia.org/wiki/EPUB
Quote:
|
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
UTF-8 problem | reda246 | Sigil | 4 | 11-21-2014 08:56 AM |
I specified UTF-16 instead of UTF-8 | ronaldl | Sigil | 2 | 12-22-2011 04:59 AM |
JB/JBL UTF-8 support | bookwarm | Ectaco jetBook | 0 | 09-03-2010 03:39 PM |
epub and utf-8 | youssef | Ectaco jetBook | 0 | 01-15-2010 07:08 PM |
UTF-8 | tompe | Calibre | 2 | 05-06-2009 06:35 PM |