XDXF Manual

Table of contents

What is XDXF?

XDXF is a format to write dictionaries in. It is free and requires no special computing expertise. In fact, if you know how to surf the web and write a simple text document, you already have the necessary skills to start.

Why XDXF?

Printed dictionaries are great to browse or search for terms whose spelling you know exactly, or at least you know how they begin. But what if you don't know how the word starts? Perhaps you know only how it sounds. Or you're just looking for a rhyming word. Or you may already have tried to find the name of a bird or plant in a foreign language consulting a good monolingual dictionary that even had the taxonomic names of species, but there was no way of searching for them. Well, not without wading through, say, half of the dictionary, that is. But even if you know the exact spelling of the word, you may not find it. (Try to look up 'fusse' in a French dictionary. And that is an inflected form of the most frequently used verb of all: 'être', meaning 'to be'...)

Recently many printed dictionaries have been transferred to computers, and that has solved this kind of problems. You can now search the entire text of the digitized dictionaries, not only the headwords; or you can specify how the word should end; etc. Many of these digitized dictionaries, however, are essentially in plain text. Digitization often strips the text of those lexicographic and typographic cues which help the reader of printed dictionaries understand the special role each word plays. People can still fairly reliably recognize them (for example the first word is the headword, the one inside square brackets near it is probaly its phonetic transcription, etc), but this recognition process often involves some analytic thinking and heuristics. These are of course things humans excel in (so much so that we usually don't even notice doing it), but they are notoriously difficult to teach computers. So if you ever wanted to present such a dictionary in another format and/or another medium, you would probably have to do the formatting of each and every entry by hand. Every time you wanted to change something.

Clearly, this is not feasible. In order to delegate such tasks to computers, we need to give them exact, unambiguous signs as to the function of each part of the text. That is what you can do with XDXF. Once the roles are clearly marked, you can convert your dictionary into any format you fancy, be it in print or online on the web. (XDXF, by the way, stands for eXtensible Dictionary eXchange Format.) You can present it in any form any time without having to touch the dictionary file itself. You can search specific sections of the entries (e.g., synonyms or antonyms). Moreover, unlike printed bilingual dictionaries, an XDXF-based dictionary can be presented differently to native speakers of the source or target languages.

Tutorial

Eight small steps to writing the first entry of your dictionary

  1. We must clearly indicate to the computer the function of each part of the text. This we do by putting one tag at the start of the section and another at the end of it. Like so:
    <Tag>section</Tag>
    
    We just created a 'Tag' element. <Tag> is its opening tag, 'section' its content, </Tag> its closing tag.
    Element names are case sensitive.
    Elements may be nested. E.g.:
    <Tag1>section1 <Tag2>section2</Tag2> </Tag1>
    
    Here element 'Tag1' has a child element 'Tag2'. Element 'Tag1' is thus the parent of element 'Tag2'.
    A child element must be fully contained in its parent. No part of it may stick out. The following is therefore invalid:
    <Tag1>section1 <Tag2>section2</Tag1> </Tag2>
    
    This kind of markup schema is called
    XML (eXtensible Markup Language). XDXF is a specific XML schema.
  2. The whole dictionary itself must be enclosed:
    <xdxf>
      the dictionary
    </xdxf>
    
    The 'xdxf' element has 3 mandatory attributes: 'format', 'lang_from' and 'lang_to'. They are inserted in the opening tag as follows.
    <xdxf format="logical" lang_from="GER" lang_to="ENG">
      the dictionary
    </xdxf>
    
    XDXF comes in two formats: logical and visual. This document describes the logical format only. (The visual format is described in the Visual Format Specification.)

    'lang_from' and 'lang_to' must take values from ISO 639-2.


  3. Dictionary entries in XDXF are called articles, and they are enclosed in 'ar' tags:
    <xdxf format="logical" lang_from="GER" lang_to="ENG">
      <ar>first article</ar>
      <ar>second article</ar>
        ...
    </xdxf>
    
  4. Articles themselves consist of the headword and its definition:
      <ar>
        <head>headword</head>
        <def>definition</def>
      </ar>
    

  5. Words and phrases that can be quickly searched for are sandwitched between 'k' tags (key phrase). (Actually the full text of the dictionary can be searched, but we pick the phrases that are most likely to be searched for, and build an index out of them for quick retrieval.) Obviously, the headword is such a phrase, therefore:
      <ar>
        <head><k>headword</k></head>
        <def>definition</def>
      </ar>
    

  6. The definition may contain the phonetic transcription, grammatical properties, the description(s) or translation(s) and inflected forms (morphological derivatives) of the headword. For example:
      <ar>
        <head><k>be</k></head>
        <def>
          <tr>bɪː</tr>
          <pos>verb</pos>
          <transitivity>intransitive</transitivity>
          <dtrn>exist</dtrn>
          <m>the morphological derivatives</m>
        </def>
      </ar>
    

    See the reference for more grammatical properties and their descriptions.
  7. The 'm' elements (morphological derivative) contain the derived forms with their grammatical properties. For example:
      <ar>
        <head><k>be</k></head>
        <def>
          <tr>bɪː</tr>
          <pos>verb</pos>
          <transitivity>intransitive</transitivity>
          <dtrn>exist</dtrn>
          <m>
          	<tense>present</tense>
          	<mood>indicative</mood>
          	<number>plural</number>
          	<k>are</k>
          </m>
          <m>
          	<tense>past</tense>
          	<mood>subjunctive</mood>
          	<k>were</k>
          </m>
        </def>
      </ar>
    
    Note that we tagged the derived forms (are, were) as key phrases (k) because we want the search for those too to yield this article.
  8. A headword can have more than one definition. The 'ar' element may have only one 'def' child, but we may put more 'def' elements inside that single 'def':
      <ar>
        <head><k>be</k></head>
        <def>
          <tr>bɪː</tr>
          <def l="1">
    	<pos>verb</pos>
    	<transitivity>intransitive</transitivity>
            <def l="a">
    	  <dtrn>exist</dtrn>
    	  <ex>I think therefore I am.</ex>
    	</def>
    	<def l="b">
    	  <dtrn>equal</dtrn>
    	  <ex>Seeing is beleiving.</ex>
    	</def>
    	<def l="c">
    	  <dtrn>has the given property</dtrn>
    	</def>
          </def>
          <def l="2">
    	<pos>auxiliary verb</pos>
    	<def l="a">
    	  <dtrn type="explanation">passive voice auxiliary</dtrn>
    	  <ex>the die is cast</ex>
    	</def>
    	<def l="b">
    	  <dtrn type="explanation">continuous tense auxiliary</dtrn>
              <ex>she is writing</ex>
    	</def>
          </def>
          <m>
          	<tense>present</tense>
          	<mood>indicative</mood>
          	<number>plural</number>
          	<k>are</k>
          </m>
          <m>
          	<tense>past</tense>
          	<mood>subjunctive</mood>
          	<k>were</k>
          </m>
        </def>
      </ar>
    
    The 'def' element has now two 'def' children, which in turn have a few 'def' children too. In order to make the grouping of definitions clear, we added 'l' (label) attributes to the 'def' elements.
    Descendants inherit the properties of their ancestors, e.g., be in the sense of 'exist' is an intransitive verb pronounced [bɪː].
    The 'ex' element is used for examples.
    The 'dtrn' elements inside the auxiliary verb definition have a 'type' attribute, whose value indicates that it is an explanation rather than a synonym or equivalent phrase.

    Let's have a look what we have already. Note that this is just one possible rendering of this article. (If you don't like it, you will find out how to change it or write your own rendering transformations later.)
  9. 'm' elements can also be nested. This comes in handy when we want some property (mood, number, etc.) to apply to a whole group of morphological derivatives:
      <ar>
        <head><k>be</k></head>
        <def>
          <tr>bɪː</tr>
          <def l="1">
    	<pos>verb</pos>
    	<transitivity>intransitive</transitivity>
            <def l="a">
    	  <dtrn>exist</dtrn>
    	  <ex>I think therefore I am.</ex>
    	</def>
    	<def l="b">
    	  <dtrn>equal</dtrn>
    	  <ex>Seeing is believing.</ex>
    	</def>
    	<def l="c">
    	  <dtrn>has the given property</dtrn>
    	</def>
          </def>
          <def l="2">
    	<pos>auxiliary verb</pos>
    	<def l="a">
    	  <dtrn type="explanation">passive voice auxiliary</dtrn>
    	  <ex>the die is cast</ex>
    	</def>
    	<def l="b">
    	  <dtrn type="explanation">continuous tense auxiliary</dtrn>
              <ex>she is writing</ex>
    	</def>
          </def>
          <m>
          	<tense>present</tense>
          	<mood>indicative</mood>
    	<m>
          	  <number>singular</number>
    	  <m><person>1</person><k>am</k></m>
    	  <m><person>2</person><k>are</k></m>
    	  <m><person>3</person><k>is</k></m>
    	</m>
    	<m>
          	  <number>plural</number>
    	  <m><person>1</person><k>are</k></m>
    	  <m><person>2</person><k>are</k></m>
    	  <m><person>3</person><k>are</k></m>
    	</m>
          </m>
          <m>
          	<tense>past</tense>
          	<mood>subjunctive</mood>
          	<k>were</k>
          </m>
        </def>
      </ar>
    

    Its rendering is here.

Perhaps now is the time to try and write an article of your choice. (Writing XML files may be a bit tedious, but a good XML editor can make life a lot easier.)

Rendering

The easiest way to convert your XDXF article to human readable format is by XSLT. The examples above also use XSLT to transform XML into HTML. (HTML is the mark-up language used by web browsers. HTML can also be regarded as a special XML schema, in which the tags describe how to format their text content, whereas XDXF tags describe its function.) The transformation is actually done by the web browser itself.

Every XML document must start with an XML declaration:

<?xml version="1.0" encoding="UTF-8" ?>

Then in order for the XML-to-HTML transformation to take place, a processing instruction must follow:

<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="XDXF-draft-logical-05-to-html.xsl"?>

So wrap your article like this:

<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="XDXF-draft-logical-05-to-html.xsl"?>

<xdxf format="logical" lang_from="GER" lang_to="ENG">
  <ar>
    your article
  </ar>
</xdxf>

and open this file in your browser.

The XSLT file used here is provided as an example only. You can change it any way you like, or write your own. (Please share it with other XDXF users if you do so.) The transformations are not part of the XDXF standard, which is described by the XDXF schema.

XSLT is a convenient way of transforming XDXF (or, in general, any XML) into HTML. A similar transformation language called XSL-FO can be used to to produce PDF documents, for example.

Spicing it up

Decent dictionaries often contain information on etymology, usage, synonyms, antonyms, collocations, etc.

Below is an example of an etymology section.

<ar>
  <head><k>Auto</k></head>
  <def>
    <pos>noun</pos>
    <gender>neuter</gender>
    <def l="">
        <dtrn>automobile</dtrn> <dtrn>car</dtrn>
    </def>
    <etym>
      <from>
	<date>1861</date>
        <lang>French</lang>
        <k>automobile</k>
        <dtrn>automobile</dtrn>
        <from>
          <lang>Ancient Greek</lang>
          <k>autos</k><tr s="Greek"><k>αὐτός</k></tr>
          <dtrn>self</dtrn> <dtrn>same</dtrn>
        </from>
        <from>
          <lang>French</lang>
          <k>mobile</k>
          <dtrn>moving</dtrn>
          <from>
            <lang>Latin</lang>
            <k>mobilis</k>
            <dtrn>movable</dtrn>
          </from>
        </from>
      </from>
    </etym>
  </def>
</ar>

It may be rendered like this.

The 'etym' element contains the etymology section. The ancestor word is enclosed in a 'from' element. Note that 'from' elements may be nested, giving rise to a whole "genealogy" tree of the word. The 'from' element may contain the language (lang) of the ancestor word, the date of its first known occurence (date), its description or translation (dtrn) — and its own ancestor (from).

The 'tr' (transcription) element may have an 's' attribute specifying the script. Phonetic transcriptions may be marked by

<tr s="IPA">fəˈnetɪk ˈskrɪpt</tr>

if the International Phonetic Alphabet is used.

The next example illustrates the proverb section.

<ar>
  <head><k>Mutter</k></head>
  <def>
    <pos>noun</pos>
    <gender>feminine</gender>
      <def l="1">
        <def l="a">
          <dtrn>mother</dtrn>
          <proverb>
            <proverb>
              <k>Mit der Mutter soll beginnen, wer die Tochter will gewinnen.</k>
              <def><dtrn>He that would the daughter win, must with the mother first begin. </dtrn></def>
            </proverb>
            <proverb>
              <k>Einer liebt die Mutter, der andere die Tochter.</k>
              <def>
                <dtrn type="literal">One loves the mother, the other the daughter.</dtrn>
                <dtrn type="explanation">There's no accounting for taste.</dtrn>
              </def>
            </proverb>
          </proverb>
        </def>
        <def l="b">
          <dtrn>source</dtrn> <dtrn>origin</dtrn>
        </def>
      <m>
        <case>nominative</case>
        <number>plural</number>
        <k>Mütter</k>
      </m>
    </def>
    <def l="2">
      <field>engineering</field>
      <dtrn>nut<co>of a bolt or screw</co></dtrn>
      <m>
        <case>nominative</case>
        <number>plural</number>
        <k>Muttern</k>
      </m>
    </def>
  </def>
</ar>

(See it rendered.)

A 'def' element may have only one 'proverb' child, but 'proverb' elements may contain either more 'proverb' elements or a 'k' (key phrase) and a 'def' (definition). 'usage' and 'colloc' (collocation) elements also work this way.

The 'co' element is used for comments.

The article above obviously belongs in a German-English dictionary, and that means that both German and English speakers may want to read it. As it is now, however, it is more suited to English speakers as additional information (comment, section name, grammatical properties) is presented in English. To make it truly bilingual, we can include this information in German as well, but then we must indicate the language whose speakers it is meant for. This we do by the 'lang_user' attribute:

<ar>
  <head><k>Mutter</k></head>
  <def>
    <pos lang_user="ENG">noun</pos>
    <pos lang_user="GER">Substantiv</pos>
    <gender lang_user="ENG">feminine</gender>
    <gender lang_user="GER">weiblich</gender>
    <def l="1">
	<def l="a">
	  <dtrn>mother</dtrn>
	  <proverb>
	    <proverb>
	      <k>Mit der Mutter soll beginnen, wer die Tochter will gewinnen.</k>
	      <def><dtrn>He that would the daughter win, must with the mother first begin. </dtrn></def>
	    </proverb>
	    <proverb>
	      <k>Einer liebt die Mutter, der andere die Tochter.</k>
	      <def>
		<dtrn type="literal">One loves the mother, the other the daughter.</dtrn>
		<dtrn type="explanation">There's no accounting for taste.</dtrn>
	      </def>
	    </proverb>
	  </proverb>
	</def>
	<def l="b">
	  <dtrn>source</dtrn> <dtrn>origin</dtrn>
	</def>
	<m>
	  <case lang_user="ENG">nominative</case>
	  <case lang_user="GER">Nominativ</case>
	  <number lang_user="ENG">plural</number>
	  <number lang_user="GER">Plural</number>
	  <k>Mütter</k>
	</m>
    </def>
    <def l="2">
	<field lang_user="ENG">engineering</field>
	<field lang_user="GER">Technik</field>
	<co lang_user="GER">von Schraube</co>
	<dtrn>nut<co lang_user="ENG">of a bolt or screw</co></dtrn>
	<m>
	  <case lang_user="ENG">nominative</case>
	  <case lang_user="GER">Nominativ</case>
	  <number lang_user="ENG">plural</number>
	  <number lang_user="GER">Plural</number>
	  <k>Muttern</k>
	</m>
    </def>
  </def>
</ar>

The 'lang_user' attribute may take the values of 'xdxf' element's 'lang_to' and 'lang_from' attributes only.

Look at its rendering. Click on the Toggle native language button to see those terms alternate between English and German. (Make sure Javascript is enabled in your browser.)

The contents of those tags which have the 'lang_user' attribute can now be presented in either language. The names that come from a tag name, attribute name or attribute value (e.g., 'proverb' and 'literal' in the above example), however, are still stuck in English, the language of the XDXF tags. In order to be able to present those too in a language other than English, we need to create a small meta-dictionary. It is called 'representations' and must be the child of the 'xdxf' element, placed before the first 'ar':

<xdxf format="logical" lang_from="GER" lang_to="ENG">

  <representations>
    <represent token="proverb"     attribute_of=""     value_of=""     lang_user="ENG">Proverb</represent>
    <represent token="proverb"     attribute_of=""     value_of=""     lang_user="GER">Sprichwort</represent>
    <represent token="literal"     attribute_of="dtrn" value_of="type" lang_user="ENG">literally</represent>
    <represent token="literal"     attribute_of="dtrn" value_of="type" lang_user="GER">wörtlich</represent>
    <represent token="explanation" attribute_of="dtrn" value_of="type" lang_user="ENG">explanation</represent>
    <represent token="explanation" attribute_of="dtrn" value_of="type" lang_user="GER">Erläuterung</represent>
  </representations>

  <ar>
    <head><k>Mutter</k></head>
    <def>
      <pos lang_user="ENG">noun</pos>
      <pos lang_user="GER">Substantiv</pos>
      <gender lang_user="ENG">feminine</gender>
      <gender lang_user="GER">weiblich</gender>
      <def l="1">
  	<def l="a">
  	  <dtrn>mother</dtrn>
  	  <proverb>
  	    <proverb>
  	      <k>Mit der Mutter soll beginnen, wer die Tochter will gewinnen.</k>
  	      <def><dtrn>He that would the daughter win, must with the mother first begin. </dtrn></def>
  	    </proverb>
  	    <proverb>
  	      <k>Einer liebt die Mutter, der andere die Tochter.</k>
  	      <def>
  		<dtrn type="literal">One loves the mother, the other the daughter.</dtrn>
  		<dtrn type="explanation">There's no accounting for taste.</dtrn>
  	      </def>
  	    </proverb>
  	  </proverb>
  	</def>
  	<def l="b">
  	  <dtrn>source</dtrn> <dtrn>origin</dtrn>
  	</def>
  	<m>
  	  <case lang_user="ENG">nominative</case>
  	  <case lang_user="GER">Nominativ</case>
  	  <number lang_user="ENG">plural</number>
  	  <number lang_user="GER">Plural</number>
  	  <k>Mütter</k>
  	</m>
      </def>
      <def l="2">
  	<field lang_user="ENG">engineering</field>
  	<field lang_user="GER">Technik</field>
  	<co lang_user="GER">von Schraube</co>
  	<dtrn>nut<co lang_user="ENG">of a bolt or screw</co></dtrn>
  	<m>
  	  <case lang_user="ENG">nominative</case>
  	  <case lang_user="GER">Nominativ</case>
  	  <number lang_user="ENG">plural</number>
  	  <number lang_user="GER">Plural</number>
  	  <k>Muttern</k>
  	</m>
      </def>
    </def>
  </ar>

</xdxf>

The 'representations' element may contain only 'represent' elements, which specify how to represent the token given as its 'token' attribute. The token may be a tag name, an attribute name or an attribute value.

Check how it works now.

Note that although the 'represent' elements for explanation contain "explanation" and "Erläuterung", no text is displayed for explanation when the article is rendered graphically. This is because the rendering XSLT transformation used in these examples does not write any text for it, but rather encloses its content in angled brackets. We should nevertheless list explanation among the representations because another renderer may want to display the text rather than use delimiters. The idea in XDXF is that the XDXF dictionary should contain all information that can possibly be necessary, and it is the renderer that should decide which part of it to use or display.

Metadata

In addition to its data, a self-contained dictionary should include some information about itself as well.

The name and description can be included in the 'full_name' and 'description' elements. These should come before the first 'ar' (article) element.

Also, if abbreviations are used, they too must be defined. This can be done in the 'abr_def' (abbreviation definition) element. Inside it, the abbreviation itself should be in a 'k' (key phrase), its definition in a 'v' (value) element. (The 'k' tag ensures that a search for the abbreviation will deliver the abbreviation definitions.)

This small dictionary-in-a-dictionary must be enclosed by 'abbreviations' tags, and should also come before the first 'ar' (article) element.

Within the dictionary entries ('ar'), the abbreviations are indicated by 'abr' tags. 'represent' elements can also contain 'abr'.

<full_name>Example</full_name>

<description>
  Just an example showing how to include metadata in the dictionary.
</description>

<abbreviations>
  <abr_def><k>n.</k><v>noun</v></abr_def>
  <abr_def><k>f.</k><v>feminine</v></abr_def>
  <abr_def><k>eng.</k><v>engineering</v></abr_def>
</abbreviations>

<ar>
  <head><k>Mutter</k></head>
  <def>
    <pos><abr>n.</abr></pos>
    <gender><abr>f.</abr></gender>
    <def l="1">
      <def l="a">
        <dtrn>mother</dtrn>
      </def>
      <def l="b">
        <dtrn>source</dtrn> <dtrn>origin</dtrn>
      </def>
    </def>
    <def l="2">
      <field>
        <abr>eng.</abr>
      </field>
      <dtrn>nut<co>of a bolt or screw</co></dtrn>
    </def>
  </def>
</ar>

XML editors

XML has been designed to be a trade-off between the exactness demanded by computers and human readability. If you have ever tried to edit an XML document, you may be forgiven for thinking that computers have got a better deal. You can nevertheless improve your position by choosing an XML-aware smart editor.

We recommend emacs or jEdit (free), or <oXygen/> (commercial).

Emacs

Emacs is a very versatile editor. For XML it has more than one package. Of those, we recommend nXML. In nXML mode your document can be validated in real-time, as you edit it. This ensures that you produce a syntactically and semantically correct XML document. (To go to nXML mode, type

M-x nxml-mode
where
M-x
stands for the combination of Alt and x, or Esc and x keys.)

For semantic validation you can use the Relax NG Compact format, which is a way to describe the schema, i.e., a collection of rules like {an 'ar' element must have one and only one 'head' element, one and only one 'def', and may have zero or more 'm' child elements}. It is indeed very compact and easy to learn. (You can, for example, easily customize the Relax NG Compact file (extension rnc) that came with this package to enumerate the possible contents of elements like 'mode', 'gender', etc., thus avoiding typos.) You can specify the Relax NG Compact file in nXML mode from the menu bar: XMLSet SchemaFile...

Emacs is available as source code or MS Windows executable. Version 23 is recommended for its improved Unicode support. At the time of writing this, version 23 has not yet been released (it is in alpha status), but it is available as source code, and it appears stable.

jEdit

jEdit is written in Java, therefore it's completely platform-independent. Like emacs, it supports real-time XML validation. It doesn't validate against Relax NG Compact, but XML Schema (another way to describe the schema; extension xsd).

Unlike Relax NG files, XML Schema files must be specified in the document itself, by adding the following two attributes to the 'xdxf' element:
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="your_XML_Schema_file.xsd"

If you want to customize the XML Schema, you can edit the xsd file itself, or you may want to edit the much more concise and transparent Relax NG Compact (rnc) file and convert it to XML Schema using trang. E.g.:

java -jar /usr/share/trang/lib/trang.jar your_RelaxNG_Compact_file.rnc your_XML_Schema_file.xsd

<oXygen/>

<oXygen/> is an XML editor for professionals. It has many fancy features that a novice XML user may not need, but the advanced user will greatly appreciate. Consider buying it if you want to write your own XSLT transformations.

Unicode

Unicode is an encoding standard with a character set large enough to contain all symbols of all human languages.

Before Unicode you could write a document, for example, in Japanese and include English or Latin text in it, but including German was already problematic, Arabic or Armenian impossible. With Unicode, these barriers have been eliminated: you can freely mix any scripts in one single document.

Unicode is meant to eventually replace all other (mutually incompatible) character encoding schemes. It is therefore highly recommended that you use Unicode for XDXF even if you could do without it.

Reference

The XDXF standard is described by the
XDXF Schema.