Originally Posted by DiapDealer
It'd be nice to eliminate all characters from the script that occur inside html tags. Those wouldn't necessarily need to be a part of any embedded font since they won't be rendered.
My (rather stupid) script expects pure utf-8 text files. You could get those by converting an epub to txt in calibre (remember to specify utf-8 as output encoding). Most authoring software can probably save to txt as well. Formatting doesn't really matter as long as every character is included. This could maybe have been more convenient, but parsing html is outside of my abilities, and I want those results before making an epub as well.
Since you might be interested only in special characters, you could just add a bunch of regular characters that you're not interested in to disallowed = set('') in line 6, ie
disallowed = set('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')
This should exclude them from the results. Add as many as you feel like.
The script wasn't really intended for publication, so it's unfortunately pretty rough, and I don't really have enough experience to improve it. It works for my needs, though