MobileRead Forums - View Single Post

Jellby · 05-21-2011, 03:31 AM

Quote:

Originally Posted by bizzybody

Here's a program that can read in a UTF-8 encoded HTML file and replace the UTF-8 HTML codes with the exact extended ASCII equivalent. https://www.mobileread.com/forums/sho...d.php?t=109996

I use recode, which is very easy:

Code:

recode utf8..html file.html

Quote:

WTH use UTF-8 for punctuation when ASCII and ordinary character encodings for Windows and other systems have characters like left and right quotes that produce exactly the same visible result? Unicode for standard characters when there's no need is text-bloat.

Replacing a couple thousand left and right unicode double quote marks with the left and right ASCII versions can reduce the file size quite a bit! A UTF-8 code is up to 7 characters, if leading zeroes are used. &#nnnn; One could write a whole text file that way but it'd be six times larger than using plain characters.

I think you are inverting the terms. Real ASCII has only 128 characters, everything else must be represented through (named or numerical) entities. ’ and & #8217; are "ASCII representations" in this discussion, as they use ASCII characters to represent another character that is not in the ASCII set. This is where text bloat is possible.

Using Unicode characters means using some Unicode encoding to represent the character directly, not through entities like above, so I can just write "é" or "ñ". These, in UTF-8, take at most 4 bytes, and typically 2 bytes (for Latin, Cyrillic or Greek scripts) or 3 bytes (for some punctuation).

But anyway, in ePUB all files are compressed, so the "bloat" introduced by the entities will be largely cancelled (since they are repetitive sequences, they can be more efficiently compressed).