MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Sigil (https://www.mobileread.com/forums/forumdisplay.php?f=203)
-   -   UTF-16 forced to UTF-8 (https://www.mobileread.com/forums/showthread.php?t=265511)

brolny 09-22-2015 02:30 AM

UTF-16 forced to UTF-8
 
1 Attachment(s)
I created a minimal epub with all files in UTF-16.
When I just open the project in the Sigil - the program (without asking or even informing me) forcibly changes the encoding of all files (ncx, opf, xhtml, css, except .js in my epub - may be becouse it not used?) to UTF-8 (the files and the headers).
When I add any UTF-16 file to epub - all the same.

How can I prevent this?
I don't want to save files as UTF-8.
Thanks.

Doitsu 09-22-2015 03:40 AM

Quote:

Originally Posted by brolny (Post 3175076)
How can I prevent this?

You can't. Sigil will automatically save all text files as utf-8 files, except for files in the Misc folder.
BTW, even though the ePub 2.0.1 standard allows both utf-8 and utf-16 files, using utf-16 files will only blow up your files with no real benefits.

For example, the file size of your original ALPHABET.xhtml utf-16 file is 58KB while the file size of the same file saved as utf-8 is only 30KB. Since the default reading app of some older eInk readers can't handle chapter files larger than 300 KB you may want to keep all chapter files as small as possible.

(utf-16 is only beneficial for CJK files and even those files are only about 30% smaller when encoded as utf-16 files.)

brolny 09-22-2015 04:55 AM

Quote:

Originally Posted by Doitsu (Post 3175106)
You can't. Sigil will automatically save all text files as utf-8 files, except for files in the Misc folder.
BTW, even though the ePub 2.0.1 standard allows both utf-8 and utf-16 files, using utf-16 files will only blow up your files with no real benefits.

For example, the file size of your original ALPHABET.xhtml utf-16 file is 58KB while the file size of the same file saved as utf-8 is only 30KB. Since the default reading app of some older eInk readers can't handle chapter files larger than 300 KB you may want to keep all chapter files as small as possible.

(utf-16 is only beneficial for CJK files and even those files are only about 30% smaller when encoded as utf-16 files.)

It's all clear about the size.
Just strange that the Sigil makes that significant changes in files - on open/add - and say nothing.
Is it hard to implement function - "Save As UTF-16 Copy..."? in the future
If Sigil do UTF-16 -> UTF-8 --- let us revert it on Save :)

Thanks a lot!

PS
We create some large and impotant html files in the MSVC C#, therefore work with UTF-16 files is fast and easy way.
So as you see - we look for benefit not from ePub, but from supporting ePub in this case.
Sometimes size does not matter :rolleyes:

brolny 09-22-2015 05:17 AM

By the way
 
Sigil can open UTF-16 ePub,
Sigil can not save UTF-16 ePub,
--------------
Has Sigil full UTF-16 support?

KevinH 09-22-2015 09:20 AM

Hi,

Sigil can read utf-16 files (or almost any properly encoded text file) but will convert it to the defacto standard of utf-8 when saving.

In short ucs-2/utf-16 became popular when people thought 16 bit numbers (65K chars) would be enough. It now takes 21 bits to fully encode all possible unicode characters. So to be clear utf-16 requires multi-char escape sequences to represent all characters in unicode just like utf-8 does. The advantages of utf-8 over utf-16 is that it is not endian dependent (ie. is a stream of bytes), where as utf-16 values must either be stored as big-endian or little-endian byte order. Thus utf-16's need for an extra BOM to tell the end user what order the two bytes that make up each value are provided in. Some cpus (though few today) are big-endian natively (namely the sparc and ppc) while others are little-endian (Intel's x86 family) natively, in addition there is network endianness (generally big) to deal with when streaming files and/or serializing objects. The problem is there are many ucs-2 routines masquarading as utf-16 routines that will break things but only for for specific ranges of unicode characters. In addition most of the early e-readers would not work with utf-16.

Because utf-8 can represent all unicode chars, needs no BOM mark (ie. is endian independent), and generally results in smaller file sizes (as Doitsu rightly points out) except for some CJK uses, and can work in all e-readers, it has become the defacto standard. And one that Sigil will not deviate from.

If you want utf-16, simply create an output plugin for sigil. It can probably be done in just a handful of lines given python's strong encoding capabilities. The plugin simply needs to iterate over the application/xhtml+xml files, reading them in a utf-8 text, replace utf-8 with utf-16 in the xml header and then properly encode the file as utf-16 and add the right bom and write it out as binary data. Probably not more than a handful of lines in python.
Perfect use for a plugin.

KevinH

brolny 09-22-2015 11:18 AM

Thanks to KevinH
 
Quote:

Originally Posted by KevinH (Post 3175197)
Hi,

...If you want utf-16, simply create an output plugin for sigil...

KevinH

Thanks, I did it today in C# in just a hour.
(unzip, reencode, set headers, zip)
But
1) I not sure what to do with mimetype - the first file in zip -
save it as ansi or utf-8 or utf-16...
2) Online tests say everything is OK, but I can't be sure for now...
So I have to check and study a lot of matter.

In any way -
Thanks a lot for your detailed answer!

KevinH 09-22-2015 01:06 PM

Hi,
As far as I know, mimetype is simply pure ascii (1 byte = 1 char), should not be compressed, and should be the first file in the archive so that external programs can read the mimetype file without having to unpack the actual zip.

That said, I have no idea if a utf-16 encoded mimetype would work. I have never tried it.

Take care,

Kevin

PeterT 09-22-2015 01:56 PM

https://en.wikipedia.org/wiki/EPUB
Quote:

The first file in the archive must be the mimetype file. It must be uncompressed so that non-ZIP utilities can read the mimetype. The mimetype file must be an ASCII file that contains the string application/epub+zip. It must be unencrypted, and the first file in the ZIP archive. This file provides a more reliable way for applications to identify the mimetype of the file than just the .epub extension.


All times are GMT -4. The time now is 06:54 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.