Quote:
Originally Posted by Markismus
If something is off about the dictionary, would you please let me know?
|
Only since you're saying. I've looked at it, and there are still so many tags and nonstandard entities which make it poorly readable. They should be all workable out with some scripting. I started writing some sed substitutions for the xdxf, but at some point I gave up. Among them:
- html entities like ' &# 171; &# 187; to be replaced with their unicode equivalent;
- coded characters. The ones I identified were greek letters, e.g. ε for ε. To be replaced by the unicode.
- phonetic/splitting patterns, like [<f>&a;&b;&è;&s;(&e;)&m;&an;</f>] to be replaced with something proper e.g. [a-b-è-s-(e)-m-an] (be careful because there are tricky cases w.r.o doubles, silent vowels, pronunciation equivalents, etc.)
- non entity splitters, like <f>&os;</f>, <f>&ns;</f>, <f>&oo;</f>. I think they separate different meanings of a lemma or explanation from example. To be replaced by a proper graphism.