Thread: Typos in ebooks
View Single Post
Old 05-21-2011, 03:31 AM   #201
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,558
Karma: 19620479
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by bizzybody View Post
Here's a program that can read in a UTF-8 encoded HTML file and replace the UTF-8 HTML codes with the exact extended ASCII equivalent. https://www.mobileread.com/forums/sho...d.php?t=109996
I use recode, which is very easy:

Code:
recode utf8..html file.html
Quote:
WTH use UTF-8 for punctuation when ASCII and ordinary character encodings for Windows and other systems have characters like left and right quotes that produce exactly the same visible result? Unicode for standard characters when there's no need is text-bloat.

Replacing a couple thousand left and right unicode double quote marks with the left and right ASCII versions can reduce the file size quite a bit! A UTF-8 code is up to 7 characters, if leading zeroes are used. &#nnnn; One could write a whole text file that way but it'd be six times larger than using plain characters.
I think you are inverting the terms. Real ASCII has only 128 characters, everything else must be represented through (named or numerical) entities. ’ and & #8217; are "ASCII representations" in this discussion, as they use ASCII characters to represent another character that is not in the ASCII set. This is where text bloat is possible.

Using Unicode characters means using some Unicode encoding to represent the character directly, not through entities like above, so I can just write "é" or "ñ". These, in UTF-8, take at most 4 bytes, and typically 2 bytes (for Latin, Cyrillic or Greek scripts) or 3 bytes (for some punctuation).

But anyway, in ePUB all files are compressed, so the "bloat" introduced by the entities will be largely cancelled (since they are repetitive sequences, they can be more efficiently compressed).
Jellby is offline   Reply With Quote