ePub Font Subsetting

Cygfrydd · 04-25-2010, 04:23 PM

I'm working on an entirely Python-based ePub build toolchain (I use Subversion for source management); have it working quite nicely, including font embedding and obfuscation. However, the resultant ePubs are suffering bloat, since I'm using fonts that have fairly extensive collections of glyphs, so I needed to implement some sort of subsetting.

This turned out to be far more complicated than I initially realised. epub-tools has been mentioned several times as supporting both obfuscation and subsetting, however, it's implemented in Java, and doesn't appear to be able to take an already-compiled ePub and modify it. Subsetting requires, it seems, two rather complex tasks: 1) parsing the content of the component files of the ePub for all elements that aren't set display: none (and possibly alt-text for images), parsing the embedded/inline-set styles to generated a computed style for each element, resolving the computed style to point at an embedded font, and then collecting the used glyphs from that font to decide what needs to be subset, and 2) subsetting the font[s] appropriately, which, as I've discovered, isn't as simple as just deleting all glyphs from the font that aren't needed (besides .notdef); apparently just modifying the Truetype 'glyf' table is insufficient.

I have an extremely ugly solution partially working, by using the Java tool css2xslfo to convert my content into XSL:FO, parsing the results to get font information and glyph coverage (drastically easier than trying to parse XHTML+CSS, and get computed styles), and then subsetting the font using a Perl tool font-optimizer to take the list of glyphs and actually do the subsetting.

This is ugly, and certainly doesn't meet my goal of doing everything in Python.

Does anyone have any suggestions? I can probably manage to cobble together workable font-subsetting using fonttools, which has a truly lovely roundtripping TTF-to-XML conversion, but the actual parsing of XHTML and associated stylesheets seems to be beyond me (though I find it difficult to believe someone hasn't already implemented this, beyond the basic stuff that cssutils does).

So... anyone have any ideas?

—Cyg

kovidgoyal · 04-25-2010, 05:08 PM

calibre resolves all CSS into simple classes of computed values as part of the conversion pipeline. This is then used for things like font size rescaling. Finding embedded fonts for subsetting should be trivial.

billingd · 08-17-2010, 08:53 AM

sorry for the noise.

04-25-2010, 04:23 PM	#1
Cygfrydd Jackass Crazyfish Posts: 7 Karma: 10 Join Date: Jan 2007 Location: St. Petersburg, Florida, US Device: Sony Reader PRS-505	ePub Font Subsetting I'm working on an entirely Python-based ePub build toolchain (I use Subversion for source management); have it working quite nicely, including font embedding and obfuscation. However, the resultant ePubs are suffering bloat, since I'm using fonts that have fairly extensive collections of glyphs, so I needed to implement some sort of subsetting. This turned out to be far more complicated than I initially realised. epub-tools has been mentioned several times as supporting both obfuscation and subsetting, however, it's implemented in Java, and doesn't appear to be able to take an already-compiled ePub and modify it. Subsetting requires, it seems, two rather complex tasks: 1) parsing the content of the component files of the ePub for all elements that aren't set display: none (and possibly alt-text for images), parsing the embedded/inline-set styles to generated a computed style for each element, resolving the computed style to point at an embedded font, and then collecting the used glyphs from that font to decide what needs to be subset, and 2) subsetting the font[s] appropriately, which, as I've discovered, isn't as simple as just deleting all glyphs from the font that aren't needed (besides .notdef); apparently just modifying the Truetype 'glyf' table is insufficient. I have an extremely ugly solution partially working, by using the Java tool css2xslfo to convert my content into XSL:FO, parsing the results to get font information and glyph coverage (drastically easier than trying to parse XHTML+CSS, and get computed styles), and then subsetting the font using a Perl tool font-optimizer to take the list of glyphs and actually do the subsetting. This is ugly, and certainly doesn't meet my goal of doing everything in Python. Does anyone have any suggestions? I can probably manage to cobble together workable font-subsetting using fonttools, which has a truly lovely roundtripping TTF-to-XML conversion, but the actual parsing of XHTML and associated stylesheets seems to be beyond me (though I find it difficult to believe someone hasn't already implemented this, beyond the basic stuff that cssutils does). So... anyone have any ideas? —Cyg

08-17-2010, 08:53 AM	#3
billingd Enthusiast Posts: 42 Karma: 8616 Join Date: May 2010 Location: Melbourne, Australia Device: Kobo	sorry for the noise. Last edited by billingd; 08-17-2010 at 08:55 AM. Reason: deleting irrelevant post

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
PDF 2 EPUB - font problem	sulka	Calibre	18	09-16-2010 06:20 AM
Font Difference Between ePUB and LRF?	EatingPie	Sony Reader	7	05-14-2010 05:32 PM
PRS-600 Default EPUB font?	jamadams	Sony Reader	5	04-06-2010 11:07 PM
ePub with external font	DairyKnight	Sony Reader	34	02-22-2010 02:31 AM
How do I insert a font in my epub using Sigil?	Haya	Sigil	2	11-10-2009 09:47 AM

04-25-2010, 05:08 PM	#2
kovidgoyal creator of calibre Posts: 44,017 Karma: 22669822 Join Date: Oct 2006 Location: Mumbai, India Device: Various	calibre resolves all CSS into simple classes of computed values as part of the conversion pipeline. This is then used for things like font size rescaling. Finding embedded fonts for subsetting should be trivial.

Advert