View Single Post
Old 10-14-2012, 03:29 PM   #29
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,552
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Because I have much more time than sense, I've done some more work on the script that counts/collects the characters used in files.

Building on the core that Man Eating Duck posted, this script will work for a single ePub, (x)html, or text file. In addition to filtering all of the html code/attributes from the results, it will also convert entities (named or otherwise) to their rendered equivalents.

It also has the ability to limit the results to a single specified CSS class (handy for determining the font-subset required for headings or drop-caps).

Python will almost always have issues printing certain unicode characters to the console on Windows OSs, so Windows users should consider just writing the results to a file and then viewing that file with an editor that supports the required character encoding.

Should work with Python 2.5 - 2.7 (maybe even earlier).

A test xhtml file is included for testing/benchmarking purposes.

Spoiler:
Code:
USAGE: characters-used.py [-h] [-o OUTFILE] [-e ENCODING] [-c CSSCLASS] FILE

This script will parse an epub/html/text file and generate a list of
unique characters used in that file.

Positional arguments:
   FILE         Input file (epub html text).

Optional arguments:
   -h, --help           show this help message and exit.
   -o OUTFILE, --outfile OUTFILE
                Output file for unique character list. (default: None)
   -e ENCODING, --encoding ENCODING
                Character encoding of input file. (default: utf-8)
   -c CSSCLASS, --cssclass CSSCLASS
                Restrict results to a specific CSS class. (default: None)
Attached Files
File Type: zip characters-used.zip (4.5 KB, 248 views)

Last edited by DiapDealer; 10-14-2012 at 03:31 PM.
DiapDealer is offline   Reply With Quote