View Full Version : glyphIgo: font minimizer for EPUB ebooks


AlPe
02-16-2013, 08:49 AM
Hi,

I would like to receive feedback and comments on a funny/crazy/useful idea I am working on in my spare time these days.

For my business needs I developed a small, simple but extremely useful Python 2.7 script to do the following things:


$ python glyphIgo.py [ARGUMENTS]

Arguments:
-h, --help : print this usage message and exit
-f, --font <font> : font file, in TTF/OTF/WOFF format
-g, --glyphs <list> : use this list of glyphs instead of opening a font file
-e, --ebook <ebook> : ebook in EPUB format
-p, --plain <ebook> : ebook file, in plain text UTF-8 format
-m, --minimize : retain only the glyphs of <font> that appear in <ebook>
-o, --output <name> : use <name> for the font to be created
-s, --sort : sort output by character count instead of character codepoint
-q, --quiet : quiet output
-v, --verbose : verbose output of Unicode codepoints

Exit codes:

0 = no error / no missing glyphs in the font file
1 = invalid argument(s) error
2 = missing glyphs in the font file to correctly display the given file/ebook
4 = minimization/conversion failed

Examples:

1. Print this usage message
$ python glyphIgo.py -h

2. Print the list of glyphs in font.ttf
$ python glyphIgo.py -f font.ttf

3. Print the list of glyphs in ebook.epub
$ python glyphIgo.py -e ebook.epub

4. Print the list of glyphs in page.xhtml
$ python glyphIgo.py -p page.xhtml

5. Check whether all the glyphs in ebook.epub are available in font.ttf
$ python glyphIgo.py -f font.ttf -e ebook.epub

6. As above, but use font_glyph_list.txt containing a list of decimal codepoints for the font glyphs
$ python glyphIgo.py -g font_glyph_list.txt -e ebook.epub

7. As in Example 5, but sort missing glyphs (if any) by character count (in ebook.epub) instead of by Unicode codepoint
$ python glyphIgo.py -f font.ttf -e ebook.epub -s

8. Create new.font.otf containing only the glyphs of font.ttf that also appear in ebook.epub
$ python glyphIgo.py -m -f font.ttf -e ebook.epub -o new.font.otf

9. Convert font.ttf (TTF) in font.otf (OTF)
$ python glyphIgo.py -f font.ttf -o font.otf


I am thinking of creating a web site that, using it, exposes its functionality to a user. The idea is to know, before you start reading an eBook that might contain non-Latin glyphs (say, the "Upanishads" or the "Haft Peykar"), if your preferred font can handle all the Unicode chars contained in the eBook.

A "video" of the prototype is here: http://www.albertopettarin.it/glyphIgo/glyphIgo.mov

Currently, I am considering the idea of 1) releasing the Python script; 2) setting up a web site for offering the service for free --- but my sysadmin/PHP skills are minimal and I do not have funds for the server (I need PHP exec and the ability of running my python script => colo server).

I would like to hear your thoughs, especially if they are amazingly crazy or funny :D

dgatwood
02-16-2013, 02:36 PM
I'd love to see this evolve into a proper EPUB font validator, although that would require a slightly more involved approach to working with the HTML than just checking to see whether all the glyphs in a particular book are present in the font.

An ideal implementation would start with the parsed HTML document and CSS declarations, then apply the CSS to determine which styles apply to which words, then check each glyph to make sure it appears in the font that is actually being used to display it. In other words, look at the font family declaration that is active at that point in the DOM tree and check each font sequentially, skipping any font name that isn't embedded.

In addition to printing errors when a glyph would be missing, it should also keep two totals of missing glyphs per font—one in which it includes every error that could potentially occur and one in which it only includes errors that do not result from falling back from another embedded font—and should present those in a summary report at the end, along with a count of the number of unused glyphs in each font.

Oh, and it could also print errors if you specify a font in your CSS and provide it in the bundle but fail to provide a proper @font-face declaration.

Start here:

https://github.com/rennat/pynliner

AlPe
02-18-2013, 02:28 PM
Hi, thanks for your suggestion.

I think about this project more on the reader's side, than the eBook production side (which should not need such a validator... but we all know how things are done in this business...).

If I have time and resources, I will try to include the features you suggested.

AlPe
02-21-2013, 11:00 AM
I released glyphIgo through Google Code:

http://code.google.com/p/glyphigo/

Comments are welcome.

AlPe
02-22-2013, 11:15 AM
glyphIgo v. 1.13 released (bug fix, better verbose output), plus I wrote a better Wiki documentation.

http://code.google.com/p/glyphigo/
http://code.google.com/p/glyphigo/wiki/UsageExamples
http://code.google.com/p/glyphigo/downloads/list

AlPe
02-24-2013, 04:27 AM
glyphIgo v. 1.14 released (better handling of X(HT)ML tags).

http://code.google.com/p/glyphigo/
http://code.google.com/p/glyphigo/wiki/UsageExamples
http://code.google.com/p/glyphigo/downloads/list

Turtle91
02-24-2013, 04:44 AM
Thanks AlPe

AlPe
03-16-2013, 04:18 PM
You are welcome.

glyphIgo v. 1.16 released.

Now you can export the lists of Unicode characters as EPUB files, for a quick check on your eReader. See usage in the wiki page linked below for a longer explanation.

http://code.google.com/p/glyphigo/
http://code.google.com/p/glyphigo/wiki/UsageExamples
http://code.google.com/p/glyphigo/downloads/list

AlPe
03-23-2013, 11:59 AM
glyphIgo v. 1.17 released.

Added a function to retrieve information about a given Unicode character. For example:

$ python glyphIgo.py -l a
[INFO] Lookup results for query 'a'
[INFO] Matched Unicode character 'a'
Name LATIN SMALL LETTER A
Character a
Dec Codepoint 97
Hex Codepoint 0x61
Lowercase a
Uppercase A
Category Ll
Bidirectional L
Mirrored False
NFC a
NFD a
[INFO] === === === === === ===

$ python glyphIgo.py -l "LATIN CAPITAL LETTER A WITH MACRON"
[INFO] Lookup results for query 'LATIN CAPITAL LETTER A WITH MACRON'
[INFO] Matched Unicode character 'Ā'
Name LATIN CAPITAL LETTER A WITH MACRON
Character Ā
Dec Codepoint 256
Hex Codepoint 0x100
Lowercase ā
Uppercase Ā
Category Lu
Bidirectional L
Mirrored False
NFC Ā
NFD Ā
[INFO] === === === === === ===

$ python glyphIgo.py -l d97
[INFO] Lookup results for query 'a'
[INFO] Matched Unicode character 'a'
Name LATIN SMALL LETTER A
Character a
Dec Codepoint 97
Hex Codepoint 0x61
Lowercase a
Uppercase A
Category Ll
Bidirectional L
Mirrored False
NFC a
NFD a
[INFO] === === === === === ===


http://code.google.com/p/glyphigo/
http://code.google.com/p/glyphigo/wiki/UsageExamples
http://code.google.com/p/glyphigo/downloads/list

AlPe
03-24-2013, 01:53 PM
glyphIgo v. 1.18 released.

Added a function to count the number of characters (displayable, not counting XHTML tags) in an EPUB eBook:


$ python glyphIgo.py -e ebook.epub -c
[INFO] Reading characters appearing in 'ebook.epub'...
[INFO] Reading characters appearing in 'ebook.epub'... Done
[INFO] Number of characters appearing in 'ebook.epub'...
1310564
[INFO] Number of characters appearing in 'ebook.epub'... Done

$ python glyphIgo.py -e ebook.epub -c -q
1310564


Please observe that the counting is somewhat "approximate", as explained in the Technical Notes in the project web page:
http://code.google.com/p/glyphigo/
http://code.google.com/p/glyphigo/wiki/UsageExamples
http://code.google.com/p/glyphigo/downloads/list

JSWolf
03-24-2013, 07:11 PM
Will this tool analyze an ePub and figure out how to subset based on the font's usage like Calibre does? Will it tell us what embedded fonts are not being used? Does it ignore the ePub code in subsetting and just deal with text?

AlPe
03-24-2013, 07:34 PM
What glyphIgo does is explained in the Technical Notes section, at https://code.google.com/p/glyphigo/

However, I will try to answer:
1) I do not know what Calibre does w.r.t. subsetting, so I cannot tell
2) no, the idea of glyphIgo is checking that an external font (like those shipped with eReaders) can display all the characters contained in a given EPUB (the assumption here is that there are no "embedded" fonts in the eBook).
3) yes, that is the default behavior, but you can consider the entire source code by invoking with the --preserve switch, which will turn off stripping away tags, hence (roughly) just retaining the displayed text.

JSWolf
03-28-2013, 10:36 AM
What Calibre does is check what each font is going to be displaying. So when it subsets a given font, it on;y subsets what that font is going to display. So for example, if you have ABCDEF the regular font will contain ABC and the bold font will contain DEF and the italic and bold italic will be removed because they have nothing to display.

Also, does glyphIgo handle ligatures? Some versions of ADE automatically use ligatures when you have the two characters together that make up the ligature such as fi or ff.

AlPe
03-28-2013, 12:51 PM
Oh I see.

No, glyphIgo does not perform such an in-depth analysis, because it is not conceived to do that. Even if it can be used to subset title fonts, its main goal consists in fast checking that a font (external to the eBook, for example a font shipped with an eReader) displays correctly the text --- irrespective to embedded fonts, font style, etc.

(The typical use case, as I wrote in the first post: you have an EPUB of the Haft Paykar (http://en.wikipedia.org/wiki/Nizami_Ganjavi#Haft_Paykar). Are all those "strange Unicode characters" going to be properly rendered by your favorite font X or are they going to appear as those nasty empty rectangles? Run glyphIgo with -e ebook.epub and -f font.ttf and you will know.)

Also, glyphIgo does not handle ligatures, if you mean whether it is able to detect them, collapse them and use the appropriate Unicode symbol. However, if a ligature is already specified as a single Unicode character, it is managed properly (as a single Unicode character).

JSWolf
03-30-2013, 09:09 PM
Also, glyphIgo does not handle ligatures, if you mean whether it is able to detect them, collapse them and use the appropriate Unicode symbol. However, if a ligature is already specified as a single Unicode character, it is managed properly (as a single Unicode character).

Then glyphIgo needs to be fixed. ADE 2.0 and some versions between 1.7.2 and 2.0 do convert to ligatures and if glyphIgo says everything is good to go in the fonts and it's not, then there could be missing characters.

Toxaris
03-31-2013, 03:43 AM
Then glyphIgo needs to be fixed. ADE 2.0 and some versions between 1.7.2 and 2.0 do convert to ligatures and if glyphIgo says everything is good to go in the fonts and it's not, then there could be missing characters.

I think AlPe can decide on his own if he wants to incorporate this. I would not call it 'fixing', since it is actually an enhancement. You want to add characters to the subset which are not used within the document itself, but by the readers. So, it is not an bug but an enhancement request. I think it is rather unpolite to ask for an enhancement with the words 'needs to be fixed'.

Jellby
03-31-2013, 03:52 AM
Also, glyphIgo does not handle ligatures, if you mean whether it is able to detect them, collapse them and use the appropriate Unicode symbol. However, if a ligature is already specified as a single Unicode character, it is managed properly (as a single Unicode character).

Unicode characters only exist for a handful of ligatures (I mean stylistic ligatures such as "ffi", not letter-like ones like or ), and they should not be used! They are only there for historical and compatibility reasons. Ligatures should be handled by the font alone.

It has already been discussed elsewhere, but it should be required for a font subsetter to properly consider ligatures, that is, at least not remove them if they are present in the original font... ideally, it should remove only unused ligatures (and "alternate" glyphs, etc., but that may be difficult to process).

Then glyphIgo needs to be fixed. ADE 2.0 and some versions between 1.7.2 and 2.0 do convert to ligatures and if glyphIgo says everything is good to go in the fonts and it's not, then there could be missing characters.

Missing ligatures would not result in missing characters. The only problem occurs if a font with ligatures is subset without taking them into account then at least two things could happen:

(a) The ligatures are completely removed. The original font may show, for instance, "ffi" or "Th" as ligatures, the subset font will simply show them as their individual characters, just like most renderers (which don't support ligatures) will do anyway.

(b) The subsetter is buggy, the ligatures are removed but their references are not. A renderer that does not support ligatures will not notice. A renderer that does, will show empty blocks or question marks where the ligatures would be.

But, assuming a given font is correct (i.e., it doesn't have references to non-existent characters), there's no way to know whether ligatures would have been used or not, and there's certainly nothing broken.

AlPe
03-31-2013, 05:48 AM
@JSWolf : thanks for the lead, but I think I will not implement that, for the reasons stated by Jellby (thanks for writing them out), and because the whole tool could be made way more "precise" on more cogent levels (e.g., by implementing full EPUB parsing, in particular, style resolution and the like). I just wanted to share with the members of the public a small tool that I coded for my own EPUB reading (& authoring) needs, which grew bigger and bigger while incorporating suggestions by friends and collegues.

@Toxaris : no offence taken, I am sure JSWolf genuinely wanted to point the issue out. Moreover, I am not a native English speaker, and I know that sometimes I might sound unpolite, simply because I do not master idiomatic forms.

Jellby
03-31-2013, 06:21 AM
@JSWolf : thanks for the lead, but I think I will not implement that, for the reasons stated by Jellby (thanks for writing them out), and because the whole tool could be made way more "precise" on more cogent levels (e.g., by implementing full EPUB parsing, in particular, style resolution and the like).

Just be aware that if you remove all ligatures when subsetting a font, you might be actually removing the very reason why the font was embedded to start with. Think of a calligraphic font, which might have special forms for ligatures or initial or final letters, maybe it doesn't look very good without the ligatures.

AlPe
04-02-2013, 11:18 AM
Thanks for the tip.

In the current version (1.19), glyphIgo simply retains those glyphs belonging to a given list (which is computed by filtering the source of the X(HT)ML files in the given EPUB). I am not sure what happens to the ligatures, I need to check what python-fontforge does in that case.

dgatwood
04-03-2013, 12:54 AM
In my experience, if you remove ligatures that exist in a 'calt' table and do not remove the corresponding 'calt' table entry, most font renderers will mindlessly display a rectangular box or a space where the character should be (i.e. "b." on Jellby's list). YMMV.

In other words, if you are subsetting a font, you must do one of the following things:


Read in the 'calt' table and painstakingly go through all the glyphs to see if any character patterns match the pattern on the left (comparison) side, and if so, add the glyphs on the right (output) side to the list of glyphs to keep,
Read in the 'calt' table and keep any glyph that appears on the right (output) side no matter what, or
Strip out any 'calt' table entries where the glyph on the right (output) side is being stripped out, which in practice probably means stripping out the 'calt' table entirely.


#1 is most correct. #2 is also correct but results in slightly larger files. #3 is kind of lame, because it will probably strip out all contextual alternates, but at least it won't result in missing letters in your text.

AlPe
04-04-2013, 02:26 PM
Wow, thanks for the detailed explanation.

Yesterday I gave a quick read at fontforge python APIs, but I failed to find a clear lead about the ligatures, when subsetting a font. I need to go through it again with greater attention, but I am quite short on time, lately.

roger64
07-02-2013, 11:01 AM
Hi

I was keen to test glyphIgo on LMDE 64 bits. I still have Python 2.7 but I cannot install python-htmlentitydefs and python-unicodedata using the software-manager because it did not find them.

AlPe
07-02-2013, 11:15 AM
Sorry, I do not know how LMDE repos work.

I think that unicodedata is a core module which is automatically installed when you install Python. In Debian there is no python-htmlentitydefs package, I think it gets installed if you install BeautifulSoup (python-beautifoulsoup).

mrmikel
07-02-2013, 01:18 PM
Wow, thanks for the detailed explanation.

Yesterday I gave a quick read at fontforge python APIs, but I failed to find a clear lead about the ligatures, when subsetting a font. I need to go through it again with greater attention, but I am quite short on time, lately.

Based on dgatwood's description of the process, I think I would be short of time if I had a two week vacation!:rofl:

AlPe
03-08-2014, 06:07 AM
I moved glyphIgo to GitHub, and re-released it under the MIT license.

Please download the latest version from:

https://github.com/pettarin/glyphIgo

Enjoy!