View Single Post
Old 01-17-2013, 05:06 AM   #352
pirl8
Pest
pirl8 ought to be getting tired of karma fortunes by now.pirl8 ought to be getting tired of karma fortunes by now.pirl8 ought to be getting tired of karma fortunes by now.pirl8 ought to be getting tired of karma fortunes by now.pirl8 ought to be getting tired of karma fortunes by now.pirl8 ought to be getting tired of karma fortunes by now.pirl8 ought to be getting tired of karma fortunes by now.pirl8 ought to be getting tired of karma fortunes by now.pirl8 ought to be getting tired of karma fortunes by now.pirl8 ought to be getting tired of karma fortunes by now.pirl8 ought to be getting tired of karma fortunes by now.
 
Posts: 204
Karma: 239254
Join Date: Jan 2012
Location: Italy
Device: KT, PW3
Quote:
Originally Posted by JSWolf View Post
Do you know how much larger the eBook would actually be with &shy added to every place possible that a hyphen could be?
Yes, of course I know.

The html part is rougly twice the original, estimated with a prefix/suffix of 2 characters, which gives the highest amount of hyphenation, considering an average of 2.5 characters per syllable and estimating 5 characters to store the "­" entity (which gets actually easily compressed).
Spoiler:
Just put the HTML files in a directory and run this script to have an estimate
Code:
#!/bin/bash
for F in *.html *.xhtml
  do lynx --assume-charset UTF8 --dump $F | 
  tr '[:alpha:]' x | 
  tr -cs 'x' '[\n*]' 
done | 
sort | 
uniq -c | 
sort -n | 
grep xxxx | 
awk 'BEGIN { \
        SL=2.5; \
        TOT=0; \
        SHY=5; \
    } \
    { 
        print $1 " instances of " length($2) " character long words" ; \
        TOT += $1 * (length($2) / SL + 0);
    } \
    END {
        print "Estimated text increase: " SHY*TOT \
    } ' 
echo -n "Total text: "
du -c -b *.html | tail -1 | awk '{print $1}'
There are no memory concerns (embedding fonts - even with subsampling - or adding oversized images for a best view with KPW which has an higher resolution uses much more memory and images cannot be compressed like text) and - at the moment - this is the ONLY way to have hyphenation on AZW3 files on e-ink Kindles.

Last edited by pirl8; 01-17-2013 at 05:23 AM.
pirl8 is offline   Reply With Quote