templatemaker (
http://code.google.com/p/templatemaker/ ) looks like the perfect thing for the web2disk utility:
Given a list of text files in a similar format, templatemaker creates a template that can extract data from files in that same format.
The library is written in Python, but the underlying longest-common-substring algorithm is implemented in C for performance.
Check out the example usage!
Sam Krupa