Since you're using utf-8 encoding anyway, I suggest replacing the numeric entities with utf-8 characters. Makes for smaller file sizes and easier parsing.
Are you planning to release your scripts to convert the gutenberg txt markup to HTML? I've found that gutenberg books tend to have a lot of variation in their markup. How well does your script handle that?
|