View Single Post
Old 12-03-2008, 04:32 PM   #3
mikecook
Enthusiast
mikecook began at the beginning.
 
mikecook's Avatar
 
Posts: 35
Karma: 10
Join Date: Jun 2007
Location: United Kingdom
Device: iPad Mini, Nexus 7, Sony Reader, Kindle, and others.
Quote:
Originally Posted by kovidgoyal View Post
Since you're using utf-8 encoding anyway, I suggest replacing the numeric entities with utf-8 characters. Makes for smaller file sizes and easier parsing.
Thanks Kovid, I've put that on the list of updates to be made.

Quote:
Originally Posted by kovidgoyal View Post
Are you planning to release your scripts to convert the gutenberg txt markup to HTML?
I actually use two scripts. The first is written in Perl, which converts from PG TXT to TEI, then I use XSL to convert into EPUB. My first XSL stylesheet actually produced HTML but it has moved on quite a lot since then. Saying that it shouldn't take too long to update the current to output HTML.

The answer to your question is yes....and no. I do hope to release this in the future but at the moment it is not really very user friendly/robust. I currently have 90 files converted (my test base) but once my site is live I will start churning out more titles, which means I can start to improve the scripts.

Quote:
Originally Posted by kovidgoyal View Post
I've found that gutenberg books tend to have a lot of variation in their markup. How well does your script handle that?
My pg2tei.pl script does a pretty good job but like you say, their are a lot of variations. Often the PG chapter headings are inconsistent so sometimes I need to do a little preprocessing with a regex lookup and adding extra blank lines after each to help distinguish that they are chapter headings.

At the moment my footnote routines only do half their job...thankfully most files only have a few so it hasn't been such a big problem. Still, I will improve this soon.

I catch most quotes but some are missing/not included in the source and I still confuse single quotes used for word contractions. e.g. 'nothin' could help 'im save the world' Shouldn't be too hard to fix most, if not all of these.

The are currently two really big areas that need improvements to speed up conversion.

Frontmatter: I process as much as I can but I still need to include the original frontmatter (between the TEI header and first chapter) so I can double check everything and add any missing info into the teiHeader and front sections.

Images: I can automatically mark-up the TEI for images but the PG txt files don't actually have any filename information. For images I need to go through the HTML version and manually add these into the TEI. I should be able to add functionality into the script to read in the the files from disk and populate the TEI file, but this is still prone to errors.

Basically, there is always going to be some manual work needed, but I hope to reduce this to a minimum pretty quickly.
mikecook is offline   Reply With Quote