View Single Post
Old 10-09-2012, 09:18 AM   #7
Man Eating Duck
Addict
Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.Man Eating Duck juggles neatly with hedgehogs.
 
Posts: 254
Karma: 69786
Join Date: May 2006
Location: Oslo, Norway
Device: Kobo Aura, Sony PRS-650
Quote:
Originally Posted by DiapDealer View Post
It'd be nice to eliminate all characters from the script that occur inside html tags. Those wouldn't necessarily need to be a part of any embedded font since they won't be rendered.
My (rather stupid) script expects pure utf-8 text files. You could get those by converting an epub to txt in calibre (remember to specify utf-8 as output encoding). Most authoring software can probably save to txt as well. Formatting doesn't really matter as long as every character is included. This could maybe have been more convenient, but parsing html is outside of my abilities, and I want those results before making an epub as well.

Since you might be interested only in special characters, you could just add a bunch of regular characters that you're not interested in to disallowed = set('') in line 6, ie
Code:
disallowed = set('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')
This should exclude them from the results. Add as many as you feel like.

The script wasn't really intended for publication, so it's unfortunately pretty rough, and I don't really have enough experience to improve it. It works for my needs, though
Man Eating Duck is offline   Reply With Quote