Originally Posted by DuckieTigger
Got a few things about Penelope. I finally downloaded it and looked through your code. It appears to me that you hacked the XML support in afterwards, especially since you parse the file by hand reinventing the wheel. Is that what happened?
For stardict the custom parser might be useful, but for XML it is not. The parser is invoked after you already gobblesmacked the XML file apart. Any useful information there might have been other than key and def are gone. If you are capable to write your own custom parser, then you should be able to output a XML file with all information necessary for Penelope. E.g. synonyms as an optional part of an <entry>. I'll rewrite your read_from_xml_format() - maybe you will like it.
yes, the XML parser was added later. But its philosophy is clear: start from a set of (word, definition) and read it. This is generally the case when you unpack a MOBI dictionary, for example. In that case, the custom parser part is still useful. For example, I might decide to extract synonyms from the definitions (for example, if my definitions have a <p>SYN: ... </p>, but only for some words).
But I agree that, in general, one can code a more complex "XML" parser. I just wanted it to be quick and dirty, with the bare minimum needed by the remaining code, as I explain in the web page
. But if you want to send me your code, I will look at it and integrate it in the tool.
Originally Posted by DuckieTigger
Also I do not yet quite understand what the difference between substititon and synonym is. Wouldn't it make sense to simply add synonyms to your global substitution list and let them be added at the end?
With reference to this data format:
[ word, include, synonyms, substitutions, definition ]
the difference is that synonyms are associated to the current word, and extracted while parsing the definition of word, while substitutions are pairs (word, substitute_with_some_other_word).
The nect effect of synonyms is that an entry is created for word, both in the index and in the definition files, while for each synonym only the index entry is created, pointing at the same definition of "word".
The nect effect of substitutions is that an entry is created for "word", pointing at the definition of "substitute_with_some_other_word".
The functional difference is that substitutions can be done only after all the definitions are actually written on disk, that's why they are accumulated and processed together at the end. On the other hand, synonyms can be inserted in the index immediately.
Three motivating examples for this strategy.
1) When you parse stuff like a wiktionary, where you have lots of pages in the form "mice is the plural of mouse". (Two pages: "Mice" and "Mouse")
If you don't want to create a definition of "mice", but still have the definition of "mouse" displayed, when processing "mice" you can set up a substitution: the dictionary will not contain an entry for "mice" but when you select "mice" on your document you will get the definition for "mouse". But since you will encounter "mice" before "mouse", you do not know at which position in the definition file you must make your "mice" index record point at. So, you will use a substitution in this case.
(Note that my code does not check that a definition for "Mouse" actually exists)
2) Another example occurs quite often in Italian, where adjectives have suffixes for masculine/feminine and singular/plural (amico, amica, amici, amiche are the four adjectives corresponding to friend). Usually in the dictionary you will find only the masculine singular (amico). But you might want all the four versions to point at the same definition: (amico, amica, amici, amiche -> amico), without having defs for "amica", "amici", "amiche".
3) Sometimes a word has more than one spelling. Again, this is particularily true in Italian, where ancient spellings co-exhist with modern ones (say, "abbazia" and "abbadia" for "abbey"). Usually you will find listed in the dictionary only the modern term, but in its definition you will find something like "ANCIENT SPELLING: ...". In this case, you parse the definition of "abbazia", find out that there is also the ancient spelling "abbadia", and add it to the "abbazia" tuple as a synonym. Doing so will create two entries in the index (one for "abbazia", one for "abbadia") pointing at the same definition.