Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 09-22-2015, 01:30 AM   #1
brolny
Connoisseur
brolny began at the beginning.
 
Posts: 64
Karma: 10
Join Date: Sep 2015
Location: Yerevan, Armenia
Device: none
Question UTF-16 forced to UTF-8

I created a minimal epub with all files in UTF-16.
When I just open the project in the Sigil - the program (without asking or even informing me) forcibly changes the encoding of all files (ncx, opf, xhtml, css, except .js in my epub - may be becouse it not used?) to UTF-8 (the files and the headers).
When I add any UTF-16 file to epub - all the same.

How can I prevent this?
I don't want to save files as UTF-8.
Thanks.
Attached Files
File Type: epub Sakayan1test.epub (10.0 KB, 182 views)
brolny is offline   Reply With Quote
Old 09-22-2015, 02:40 AM   #2
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,685
Karma: 24031401
Join Date: Dec 2010
Device: Kindle PW2
Quote:
Originally Posted by brolny View Post
How can I prevent this?
You can't. Sigil will automatically save all text files as utf-8 files, except for files in the Misc folder.
BTW, even though the ePub 2.0.1 standard allows both utf-8 and utf-16 files, using utf-16 files will only blow up your files with no real benefits.

For example, the file size of your original ALPHABET.xhtml utf-16 file is 58KB while the file size of the same file saved as utf-8 is only 30KB. Since the default reading app of some older eInk readers can't handle chapter files larger than 300 KB you may want to keep all chapter files as small as possible.

(utf-16 is only beneficial for CJK files and even those files are only about 30% smaller when encoded as utf-16 files.)
Doitsu is offline   Reply With Quote
Advert
Old 09-22-2015, 03:55 AM   #3
brolny
Connoisseur
brolny began at the beginning.
 
Posts: 64
Karma: 10
Join Date: Sep 2015
Location: Yerevan, Armenia
Device: none
Quote:
Originally Posted by Doitsu View Post
You can't. Sigil will automatically save all text files as utf-8 files, except for files in the Misc folder.
BTW, even though the ePub 2.0.1 standard allows both utf-8 and utf-16 files, using utf-16 files will only blow up your files with no real benefits.

For example, the file size of your original ALPHABET.xhtml utf-16 file is 58KB while the file size of the same file saved as utf-8 is only 30KB. Since the default reading app of some older eInk readers can't handle chapter files larger than 300 KB you may want to keep all chapter files as small as possible.

(utf-16 is only beneficial for CJK files and even those files are only about 30% smaller when encoded as utf-16 files.)
It's all clear about the size.
Just strange that the Sigil makes that significant changes in files - on open/add - and say nothing.
Is it hard to implement function - "Save As UTF-16 Copy..."? in the future
If Sigil do UTF-16 -> UTF-8 --- let us revert it on Save

Thanks a lot!

PS
We create some large and impotant html files in the MSVC C#, therefore work with UTF-16 files is fast and easy way.
So as you see - we look for benefit not from ePub, but from supporting ePub in this case.
Sometimes size does not matter
brolny is offline   Reply With Quote
Old 09-22-2015, 04:17 AM   #4
brolny
Connoisseur
brolny began at the beginning.
 
Posts: 64
Karma: 10
Join Date: Sep 2015
Location: Yerevan, Armenia
Device: none
Talking By the way

Sigil can open UTF-16 ePub,
Sigil can not save UTF-16 ePub,
--------------
Has Sigil full UTF-16 support?
brolny is offline   Reply With Quote
Old 09-22-2015, 08:20 AM   #5
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,488
Karma: 5703586
Join Date: Nov 2009
Device: many
Hi,

Sigil can read utf-16 files (or almost any properly encoded text file) but will convert it to the defacto standard of utf-8 when saving.

In short ucs-2/utf-16 became popular when people thought 16 bit numbers (65K chars) would be enough. It now takes 21 bits to fully encode all possible unicode characters. So to be clear utf-16 requires multi-char escape sequences to represent all characters in unicode just like utf-8 does. The advantages of utf-8 over utf-16 is that it is not endian dependent (ie. is a stream of bytes), where as utf-16 values must either be stored as big-endian or little-endian byte order. Thus utf-16's need for an extra BOM to tell the end user what order the two bytes that make up each value are provided in. Some cpus (though few today) are big-endian natively (namely the sparc and ppc) while others are little-endian (Intel's x86 family) natively, in addition there is network endianness (generally big) to deal with when streaming files and/or serializing objects. The problem is there are many ucs-2 routines masquarading as utf-16 routines that will break things but only for for specific ranges of unicode characters. In addition most of the early e-readers would not work with utf-16.

Because utf-8 can represent all unicode chars, needs no BOM mark (ie. is endian independent), and generally results in smaller file sizes (as Doitsu rightly points out) except for some CJK uses, and can work in all e-readers, it has become the defacto standard. And one that Sigil will not deviate from.

If you want utf-16, simply create an output plugin for sigil. It can probably be done in just a handful of lines given python's strong encoding capabilities. The plugin simply needs to iterate over the application/xhtml+xml files, reading them in a utf-8 text, replace utf-8 with utf-16 in the xml header and then properly encode the file as utf-16 and add the right bom and write it out as binary data. Probably not more than a handful of lines in python.
Perfect use for a plugin.

KevinH
KevinH is offline   Reply With Quote
Advert
Old 09-22-2015, 10:18 AM   #6
brolny
Connoisseur
brolny began at the beginning.
 
Posts: 64
Karma: 10
Join Date: Sep 2015
Location: Yerevan, Armenia
Device: none
Thanks to KevinH

Quote:
Originally Posted by KevinH View Post
Hi,

...If you want utf-16, simply create an output plugin for sigil...

KevinH
Thanks, I did it today in C# in just a hour.
(unzip, reencode, set headers, zip)
But
1) I not sure what to do with mimetype - the first file in zip -
save it as ansi or utf-8 or utf-16...
2) Online tests say everything is OK, but I can't be sure for now...
So I have to check and study a lot of matter.

In any way -
Thanks a lot for your detailed answer!
brolny is offline   Reply With Quote
Old 09-22-2015, 12:06 PM   #7
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,488
Karma: 5703586
Join Date: Nov 2009
Device: many
Hi,
As far as I know, mimetype is simply pure ascii (1 byte = 1 char), should not be compressed, and should be the first file in the archive so that external programs can read the mimetype file without having to unpack the actual zip.

That said, I have no idea if a utf-16 encoded mimetype would work. I have never tried it.

Take care,

Kevin
KevinH is offline   Reply With Quote
Old 09-22-2015, 12:56 PM   #8
PeterT
Grand Sorcerer
PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.PeterT ought to be getting tired of karma fortunes by now.
 
Posts: 13,313
Karma: 78876004
Join Date: Nov 2007
Location: Toronto
Device: Libra H2O, Libra Colour
https://en.wikipedia.org/wiki/EPUB
Quote:
The first file in the archive must be the mimetype file. It must be uncompressed so that non-ZIP utilities can read the mimetype. The mimetype file must be an ASCII file that contains the string application/epub+zip. It must be unencrypted, and the first file in the ZIP archive. This file provides a more reliable way for applications to identify the mimetype of the file than just the .epub extension.
PeterT is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
UTF-8 problem reda246 Sigil 4 11-21-2014 08:56 AM
I specified UTF-16 instead of UTF-8 ronaldl Sigil 2 12-22-2011 04:59 AM
JB/JBL UTF-8 support bookwarm Ectaco jetBook 0 09-03-2010 03:39 PM
epub and utf-8 youssef Ectaco jetBook 0 01-15-2010 07:08 PM
UTF-8 tompe Calibre 2 05-06-2009 06:35 PM


All times are GMT -4. The time now is 12:21 AM.


MobileRead.com is a privately owned, operated and funded community.