View Full Version : Unique characters used

03-14-2012, 06:00 AM
In another thread about including fonts, of course fontsize came up. Usually fonts are big so they will add to the ePUB size.
However, it is possible to reduce a font to only the characters you use (headers, notes, foreign). This helps to reduce the size tremendously.
Normal methods are via either FontSquirrel or FontForge. Usually the biggest problem is, how to determine which characters you actually need.

Therefore I created a Word macro to get the unique characters in a document. It can be for all fonts or for a specific font. With that list the font can be reduced.

Please take care, some (most?) fonts may not be distributed. Since with this you can make a subset, it might be allowed. To be safe, only use it on free fonts.

* Update. Small error

03-14-2012, 06:27 AM
Well if the subset is complete enough to cover an entire book, chances are it won't be allowed still. :)

I wasn't too successful with FontForge when it came to removing characters in a font - somehow what fontforge saved was several times the size of the original font.

03-14-2012, 08:57 AM
As long as you actually delete the glyphs and not clear them, it should definitely be smaller. You can also use FontSquirrel and input the glyphs you need.

03-14-2012, 11:44 AM
I guess you can create a single XHTML document, apply "display:none" to everything except the parts that use the font you want", open the file in a browser, copy the text somewhere else and identify the unique characters there.

03-14-2012, 01:41 PM
@frostschutz :

Don't know if this may help, but here is how I subset fonts with fontforge, using a script which:

1) open the font with Open()
2) unselect all characters with SelectNone()
3) select all needed characters with SelectMore()
4) invert the selection with SelectInvert()
5) delete the selected characters with Clear()
6) create a new font with Generate()

If some needed characters are references, you need to selected both the characters and the referenced characters.

as an example: a 1252 characters, 138KB truetype fonts subsetted to 145 characters then weights 32KB

03-14-2012, 05:05 PM
Here's how you can find the characters used in a xhtml file (tags are excluded) in a unix bash shell:
cat file.xhtml|sed -e 's/<[^>]\+>//g' -e 's/./&\n/g' |sort -u |tr "\n" " "
If you want to just find the characters in headers, you can try:
grep "<h[1-4]" OEBPS/vol1/12.xhtml|sed -e 's/<[^>]\+>//g' -e 's/./&\n/g' |sort -u |tr "\n" " "

03-14-2012, 05:15 PM
That will work, but usually I want to know it for a specific font and since Word is already in my process...

BTW, the macro is updated. Stupid VBA is not always case-sensitive.

03-15-2012, 05:01 PM
Those one-liners don't decode entities, I'm afraid (although they can be converted beforehand with recode).

03-16-2012, 04:50 AM
@Jellby: *Sigh* isn't there just always something... thanks for pointing it out. I try again:
cat file.xhtml|xml2asc|sed -e 's/<[^>]\+>//g' -e 's/./&\n/g' |sort -u |tr "\n" " "
How many lines do you need to do this properly, I wonder; find which tags use special fonts, extract their content etc.?

03-16-2012, 01:36 PM
Well, xml2asc in my box doesn't seem to do what you apparently use it for:

Reads an UTF-8 encoded text from standard input and writes to standard output, converting all non-ASCII characters to &#nnn; entities, so that the result is ASCII-encoded.

Also, consider what to do with an input such as:

Let a &lt; &epsilon; &gt; b

Do you get <, ε, and > in the character list?

03-16-2012, 06:28 PM
@Jellby: Right.... Not exactly a one liner anymore, and the script has all the clarity of white noise. Couldn't be bothered to double-check @font-face in css, and tags with a font inside another tag with a font is not handled correctly (the outer tag will include characters from the inner tag). Typically used as :
./ <xhtml-file> <css-file>
If the css is inline, it's just
./ <xhtml-file>
Tries to detect extra fonts in the css-file, which classes/ids which use them, and lists which characters are used for which fonts.
xmlns=$(grep -o "xmlns=.[^\"']\+" ${file}|cut -c8-180)
# remove comments
awk 'BEGIN{RS="\(<.\-\-\|\-\->\)"} {if ((NR % 2)==1) print;}' $file > tmp
#replace html entities
for x in $(sed 's/&[a-zA-Z0-9]\+;/&\n/g' tmp|grep -o "&[a-zA-Z0-9]\+;")
do sed -i "s/${x}/$(echo $x|recode HTML..UTF-8)/g" tmp
# extract inline css
(if [[ $(grep -Fxq '</style' $css) ]]
then sed -n /<style/,/<\/style/p $css
else cat $css
tr "\n" " " |\
sed 's/[>}]/&\n/g' |\
grep -v "@font-face" |\
sed -n "/font-family: *[\"']/{s/^ *\(.*\) *{.*font-family: *[\"']\([^\"']\+\).*/\1 \2/;p}" |\
sed -e '/^\./s/^/\*#/' -e 's#\(.*\)\.\([^ \]\+\)#a:\1[@class="\2"]#' -e 's/.*#\([^ ]*\)/*[@id="\1"]/'|\
while read line
echo "${line#* }: "
echo -e "setns a=${xmlns}\ncat //${line%% *}//text()" |\
xmllint --shell tmp |\
sed -e 1d -e '/^\(\/ >\| -\{7\}$\)/d' -e 's/./&\n/g' |\
sort -u |\
sed '/[ \t]/d' |\
sed -n 'H;${x;s/\n//g;p}'

03-16-2012, 06:47 PM
If I look at the scripts, I rather use my Word macro...

03-16-2012, 07:17 PM
I have to admit it it perhaps not the most aesthetically pleasing program to rest one's weary eyes upon... However, it might come in handy when refurbishing existing epubs, since it operates on xhtml files.

03-18-2012, 04:52 AM
Hmm, someone mentioned smallcaps. It might be an idea to also make it possible for smallcaps. I think I will enhance it further later this week.

03-18-2012, 07:02 AM
The problem with smallcaps is you'd like them to match the normal text. If you embed a font for smallcaps and not for the rest, you'll create the possibility of having, for instance, a sans-serif text with serif smallcaps (ugh).

03-18-2012, 09:28 AM
Correct. It is not ideal. But the creator can circumvent that by defining that the normal text should be serif and then include serif smallcaps. If that is overruled by the reader or application, limits are off anyway.

03-18-2012, 09:40 AM
The best solution would be for reader devices and applications to properly support smallcaps, for god's sake! :)

03-19-2012, 03:30 AM
Amen to that!