View Single Post
Old 05-17-2009, 03:39 PM   #31
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Some of the comments, and particularly the special focus on quotation marks makes me wonder if my original "dream" of a single script to take a plaintext document from A to Z is misguided. (Even though, as I stated earlier, it seems to me GutenMark does a reasonable job of that.)

Perhaps a better way would be to write small utilities each of which focus on just one aspect of the document cleanup/conversion/fix-up process. I myself might play around with a quotation mark fixing utility, when I get a chance over the next week or so.

Some other utilities I can think of:

- metadata recognizer (i.e.: figures out title, author, chapter titles, et al)
- paragraph normalizer (remove manual linebreaks between lines, keep only one between paragraphs)
- emphasis normalizer (convert the myriad ways of indicating emphasis into a single standard [and ideally simple to accurately parse] markup)

All of these utilities I think of as being command line tools that do as much as possible without human intervention but ask for human/manual arbitration when up against a case that requires a judgement call (or, rather, actually understanding the text).

Does anything like this exist? Is the idea kind of crazy, or kind of sensible?

Sincerely,

AHI
ahi is offline   Reply With Quote