View Single Post
Old 04-22-2008, 02:58 AM   #1
cerement
Groupie
cerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it iscerement knows what time it is
 
cerement's Avatar
 
Posts: 170
Karma: 2000
Join Date: Apr 2008
Location: San José, CA
Device: Amazon Kindle 1, Sony PRS-300, Amazon Kindle 3
Prepping texts for conversion?

Has anyone created or found a decent howto for converting Gutenberg text files to HTML in preparation for conversion to ebooks?

What I've been able to come up with so far for things to watch (a large majority of these taken from Gutenberg's website):
  1. Remove Gutenberg boilerplate
  2. Escape out existing characters <, >, &
  3. Encode accented characters
  4. Convert and encode special characters endash, emdash, ellipses
  5. Convert quotes to curly quotes, apostrophes
  6. Clean up double spaces, remove spaces at end of lines
  7. Convert multiple blank lines to a rule (from Gutenberg)
  8. Add in HTML header and footer
  9. Add in paragraph marks
  10. Mark up headings
  11. Clean up special line breaks and indents
  12. Italics and bold
  13. Images

Does anyone have a decent search-replace for handling curly quotes? Found a couple algorithms online but they all seem to run into special cases (nested quotes, quotes inside brackets) and all seem to just give up when trying to deal with something like 5'10" inside a quote ...
cerement is offline   Reply With Quote