Next consideration...
... how to architect this for internationalization. My own personal need is already related to two languages, English and Hungarian, and I would prefer adding additional ones to be a reasonably straightforward process.
Part me of wonders whether the easiest thing might be to separate out the processing functions into language-specific python modules to be included by the main pacify.py script as per need. But I'm all but certain that is at least an inelegant and probably a very substandard approach. What if, after all, a given document has text in two languages (and both RTF and HTML have language tagging capacity... so applying the appropriate rules is at least conceivable).
I'm also aware that the best case scenario would be to somehow store the rules in an externally stored and loadable data/config file of some sort. But I it might ultimately end up being overly complicated...
How do you store as a data file a rule like:
Quote:
When encountering an apostrophe, if the document doesn't already contain smart single quotes, check whether a single quoted section is already open. If it is, and there is a space to the right of the apostrophe, consider it a closing apostrophe, unless it follows the letter S, in which case look further ahead to ascertain whether there is a better "closing single quote" candidate further in the paragraph... et cetera, et cetera, et cetera.
|
To store that externally, I'd be forcing myself to create a quasi-scripting language, I think. Which, given that python itself is already a scripting language, seems silly.
Maybe have the main program contain a "skeleton" of all processing functions, which would then (based on command line options and/or imported metadata) in turn call language-specific versions of the processing questions on the fly at runtime?
FixParagraphs(text) would load FixParagraphs.hu.py or FixParagraphs.en.py depending on language and do an eval("FixParagraphs_"+curlang+"(text)")
I'll continue thinking on it further...
- Ahi