Has anyone created or found a decent howto for converting Gutenberg text files to HTML in preparation for conversion to ebooks?
What I've been able to come up with so far for things to watch (a large majority of these taken from Gutenberg's website):
- Remove Gutenberg boilerplate
- Escape out existing characters <, >, &
- Encode accented characters
- Convert and encode special characters endash, emdash, ellipses
- Convert quotes to curly quotes, apostrophes
- Clean up double spaces, remove spaces at end of lines
- Convert multiple blank lines to a rule (from Gutenberg)
- Add in HTML header and footer
- Add in paragraph marks
- Mark up headings
- Clean up special line breaks and indents
- Italics and bold
- Images
Does anyone have a decent search-replace for handling curly quotes? Found a couple algorithms online but they all seem to run into special cases (nested quotes, quotes inside brackets) and all seem to just give up when trying to deal with something like 5'10" inside a quote ...