View Single Post
Old 04-25-2010, 04:23 PM   #1
Cygfrydd
Jackass Crazyfish
Cygfrydd began at the beginning.
 
Cygfrydd's Avatar
 
Posts: 7
Karma: 10
Join Date: Jan 2007
Location: St. Petersburg, Florida, US
Device: Sony Reader PRS-505
ePub Font Subsetting

I'm working on an entirely Python-based ePub build toolchain (I use Subversion for source management); have it working quite nicely, including font embedding and obfuscation. However, the resultant ePubs are suffering bloat, since I'm using fonts that have fairly extensive collections of glyphs, so I needed to implement some sort of subsetting.

This turned out to be far more complicated than I initially realised. epub-tools has been mentioned several times as supporting both obfuscation and subsetting, however, it's implemented in Java, and doesn't appear to be able to take an already-compiled ePub and modify it. Subsetting requires, it seems, two rather complex tasks: 1) parsing the content of the component files of the ePub for all elements that aren't set display: none (and possibly alt-text for images), parsing the embedded/inline-set styles to generated a computed style for each element, resolving the computed style to point at an embedded font, and then collecting the used glyphs from that font to decide what needs to be subset, and 2) subsetting the font[s] appropriately, which, as I've discovered, isn't as simple as just deleting all glyphs from the font that aren't needed (besides .notdef); apparently just modifying the Truetype 'glyf' table is insufficient.

I have an extremely ugly solution partially working, by using the Java tool css2xslfo to convert my content into XSL:FO, parsing the results to get font information and glyph coverage (drastically easier than trying to parse XHTML+CSS, and get computed styles), and then subsetting the font using a Perl tool font-optimizer to take the list of glyphs and actually do the subsetting.

This is ugly, and certainly doesn't meet my goal of doing everything in Python.

Does anyone have any suggestions? I can probably manage to cobble together workable font-subsetting using fonttools, which has a truly lovely roundtripping TTF-to-XML conversion, but the actual parsing of XHTML and associated stylesheets seems to be beyond me (though I find it difficult to believe someone hasn't already implemented this, beyond the basic stuff that cssutils does).

So... anyone have any ideas?

—Cyg
Cygfrydd is offline   Reply With Quote