View Full Version : Unique characters used


Toxaris
03-14-2012, 05:00 AM
In another thread about including fonts, of course fontsize came up. Usually fonts are big so they will add to the ePUB size.
However, it is possible to reduce a font to only the characters you use (headers, notes, foreign). This helps to reduce the size tremendously.
Normal methods are via either FontSquirrel or FontForge. Usually the biggest problem is, how to determine which characters you actually need.

Therefore I created a Word macro to get the unique characters in a document. It can be for all fonts or for a specific font. With that list the font can be reduced.

Please take care, some (most?) fonts may not be distributed. Since with this you can make a subset, it might be allowed. To be safe, only use it on free fonts.

* Update. Small error

frostschutz
03-14-2012, 05:27 AM
Well if the subset is complete enough to cover an entire book, chances are it won't be allowed still. :)

I wasn't too successful with FontForge when it came to removing characters in a font - somehow what fontforge saved was several times the size of the original font.

Toxaris
03-14-2012, 07:57 AM
As long as you actually delete the glyphs and not clear them, it should definitely be smaller. You can also use FontSquirrel and input the glyphs you need.

Jellby
03-14-2012, 10:44 AM
I guess you can create a single XHTML document, apply "display:none" to everything except the parts that use the font you want", open the file in a browser, copy the text somewhere else and identify the unique characters there.

Trouhel
03-14-2012, 12:41 PM
@frostschutz :

Don't know if this may help, but here is how I subset fonts with fontforge, using a script which:

1) open the font with Open()
2) unselect all characters with SelectNone()
3) select all needed characters with SelectMore()
4) invert the selection with SelectInvert()
5) delete the selected characters with Clear()
6) create a new font with Generate()

If some needed characters are references, you need to selected both the characters and the referenced characters.

as an example: a 1252 characters, 138KB truetype fonts subsetted to 145 characters then weights 32KB

SBT
03-14-2012, 04:05 PM
Here's how you can find the characters used in a xhtml file (tags are excluded) in a unix bash shell:
cat file.xhtml|sed -e 's/<[^>]\+>//g' -e 's/./&\n/g' |sort -u |tr "\n" " "
If you want to just find the characters in headers, you can try:
grep "<h[1-4]" OEBPS/vol1/12.xhtml|sed -e 's/<[^>]\+>//g' -e 's/./&\n/g' |sort -u |tr "\n" " "

Toxaris
03-14-2012, 04:15 PM
That will work, but usually I want to know it for a specific font and since Word is already in my process...

BTW, the macro is updated. Stupid VBA is not always case-sensitive.

Jellby
03-15-2012, 04:01 PM
Those one-liners don't decode entities, I'm afraid (although they can be converted beforehand with recode).

SBT
03-16-2012, 03:50 AM
@Jellby: *Sigh* isn't there just always something... thanks for pointing it out. I try again:
cat file.xhtml|xml2asc|sed -e 's/<[^>]\+>//g' -e 's/./&\n/g' |sort -u |tr "\n" " "
How many lines do you need to do this properly, I wonder; find which tags use special fonts, extract their content etc.?

Jellby
03-16-2012, 12:36 PM
Well, xml2asc in my box doesn't seem to do what you apparently use it for:

Reads an UTF-8 encoded text from standard input and writes to standard output, converting all non-ASCII characters to &#nnn; entities, so that the result is ASCII-encoded.

Also, consider what to do with an input such as:

Let a &lt; &epsilon; &gt; b

Do you get <, ε, and > in the character list?

SBT
03-16-2012, 05:28 PM
@Jellby: Right.... Not exactly a one liner anymore, and the script has all the clarity of white noise. Couldn't be bothered to double-check @font-face in css, and tags with a font inside another tag with a font is not handled correctly (the outer tag will include characters from the inner tag). Typically used as :
./script.sh <xhtml-file> <css-file>
If the css is inline, it's just
./script.sh <xhtml-file>
Tries to detect extra fonts in the css-file, which classes/ids which use them, and lists which characters are used for which fonts.
#!/bin/bash
file=$1
css=${!#}
xmlns=$(grep -o "xmlns=.[^\"']\+" ${file}|cut -c8-180)
# remove comments
awk 'BEGIN{RS="\(<.\-\-\|\-\->\)"} {if ((NR % 2)==1) print;}' $file > tmp
#replace html entities
for x in $(sed 's/&[a-zA-Z0-9]\+;/&\n/g' tmp|grep -o "&[a-zA-Z0-9]\+;")
do sed -i "s/${x}/$(echo $x|recode HTML..UTF-8)/g" tmp
done
# extract inline css
(if [[ $(grep -Fxq '</style' $css) ]]
then sed -n /<style/,/<\/style/p $css
else cat $css
fi)|\
tr "\n" " " |\
sed 's/[>}]/&\n/g' |\
grep -v "@font-face" |\
sed -n "/font-family: *[\"']/{s/^ *\(.*\) *{.*font-family: *[\"']\([^\"']\+\).*/\1 \2/;p}" |\
sed -e '/^\./s/^/\*#/' -e 's#\(.*\)\.\([^ \]\+\)#a:\1[@class="\2"]#' -e 's/.*#\([^ ]*\)/*[@id="\1"]/'|\
while read line
do
echo
echo "${line#* }: "
echo -e "setns a=${xmlns}\ncat //${line%% *}//text()" |\
xmllint --shell tmp |\
sed -e 1d -e '/^\(\/ >\| -\{7\}$\)/d' -e 's/./&\n/g' |\
sort -u |\
sed '/[ \t]/d' |\
sed -n 'H;${x;s/\n//g;p}'
done

Toxaris
03-16-2012, 05:47 PM
If I look at the scripts, I rather use my Word macro...

SBT
03-16-2012, 06:17 PM
I have to admit it it perhaps not the most aesthetically pleasing program to rest one's weary eyes upon... However, it might come in handy when refurbishing existing epubs, since it operates on xhtml files.

Toxaris
03-18-2012, 03:52 AM
Hmm, someone mentioned smallcaps. It might be an idea to also make it possible for smallcaps. I think I will enhance it further later this week.

Jellby
03-18-2012, 06:02 AM
The problem with smallcaps is you'd like them to match the normal text. If you embed a font for smallcaps and not for the rest, you'll create the possibility of having, for instance, a sans-serif text with serif smallcaps (ugh).

Toxaris
03-18-2012, 08:28 AM
Correct. It is not ideal. But the creator can circumvent that by defining that the normal text should be serif and then include serif smallcaps. If that is overruled by the reader or application, limits are off anyway.

Jellby
03-18-2012, 08:40 AM
The best solution would be for reader devices and applications to properly support smallcaps, for god's sake! :)

Toxaris
03-19-2012, 02:30 AM
Amen to that!