Quote:
|
Originally Posted by FangornUK
Nice work! Looks very promising.
You seem to be doing some Gutenberg specific detections and a simple clean-up for the HTML versions is page number stripping, I do that in gutlrf.pl like so:
$_ =~ s#<span class='pagenum'>.*</span>## ;
$_ =~ s#<span class=\"pagenum\">.*</span>## ;
I'll post more bug reports to the google code site.
|
Can you point me to a book on Gutenberg that has these page number spans? I was using the Adventures of Sherlock Holmes, but it doesn't have any.
BTW does Gutenberg have a recommended HTML format that you're aware of, or are they at the mercy of every submitters ideas of what good HTML is?
If I wind up doing a bunch of html cleanup then I'll probably implement it where it reads various cleanup parameters (maybe like the two you put above) from a data file so that users can add their own values.