Some of the comments, and particularly the special focus on quotation marks makes me wonder if my original "dream" of a single script to take a plaintext document from A to Z is misguided. (Even though, as I stated earlier, it seems to me GutenMark does a reasonable job of that.)
Perhaps a better way would be to write small utilities each of which focus on just one aspect of the document cleanup/conversion/fix-up process. I myself might play around with a quotation mark fixing utility, when I get a chance over the next week or so.
Some other utilities I can think of:
- metadata recognizer (i.e.: figures out title, author, chapter titles, et al)
- paragraph normalizer (remove manual linebreaks between lines, keep only one between paragraphs)
- emphasis normalizer (convert the myriad ways of indicating emphasis into a single standard [and ideally simple to accurately parse] markup)
All of these utilities I think of as being command line tools that do as much as possible without human intervention but ask for human/manual arbitration when up against a case that requires a judgement call (or, rather, actually understanding the text).
Does anything like this exist? Is the idea kind of crazy, or kind of sensible?
Sincerely,
AHI
|