View Full Version : Proposal: Extending Epub with reference book tags


Nate the great
05-20-2009, 04:06 PM
I make my own reference books in Mobipocket, and I would like to do the same in Epub. Unfortunately, the needed tags don't exist yet in the Epub spec.

I'm going to get the ball rolling by listing the three details I've noticed about Mobipocket dictionaries (if I missed one please point it out):

1, an entry in an ebook's metada that indicates it's a dictionary (necessary?);

2, two more entries in the OPF that indicate the input and output languages;

3, the set of tags in the content that define the parts of a data entry (idx:entry, idx:orth, idx:key, idx:short, idx:gramgrp, idx:subentry, idx:string, idx:ext-subentry). You can find more about them here (http://www.mobipocket.com/dev/article.asp?BaseFolder=prcgen&File=tagref_mobi.xml).


I don't think all of the tags are necessary. Here is what I would like to propose as a starting point. Also, I'm going to be shameless and simply copy the function and attributes of the existing Mobipocket tags. I've changed some of the names so they are easier to understand.
<idpf:entry> </idpf:entry> (http://www.mobipocket.com/dev/article.asp?BaseFolder=prcgen&File=tagref_mobi.xml#anchor_idx:entry)
<idpf:title> </idpf:title> (http://www.mobipocket.com/dev/article.asp?BaseFolder=prcgen&File=tagref_mobi.xml#anchor_idx:orth)
<idpf:keyword> </idpf:keyword> (http://www.mobipocket.com/dev/article.asp?BaseFolder=prcgen&File=tagref_mobi.xml#anchor_idx:key)
<idpf:stub> </idpf:stub> (http://www.mobipocket.com/dev/article.asp?BaseFolder=prcgen&File=tagref_mobi.xml#anchor_idx:short)
<idpf:subentry> </idpf:subentry> (http://www.mobipocket.com/dev/article.asp?BaseFolder=prcgen&File=tagref_mobi.xml#anchor_idx:subentry)


So, what do you think?

DaleDe
05-20-2009, 05:12 PM
What was wrong with idx (index) for these entries? It would make more sense the idpf I believe.

Dale

wallcraft
05-20-2009, 05:42 PM
Why not use an existing XML-based dictionary format? Perhaps XDFX (http://xdxf.revdanica.com/). The ePub standard has the concept of XML in-line islands, but I don't have a clear idea of what it takes to produce a dictionary that is also a valid ePub document.

DaleDe
05-20-2009, 05:57 PM
Why not use an existing XML-based dictionary format? Perhaps XDFX (http://xdxf.revdanica.com/). The ePub standard has the concept of XML in-line islands, but I don't have a clear idea of what it takes to produce a dictionary that is also a valid ePub document.

A quick look shows that it is mostly compatible with XHTML already. See http://xdxf.revdanica.com/drafts/visual/latest/XDXF-draft-028.txt

It has the advantages of already have dictionaries available. Perhaps there isn't much work to do to make everything work together.

Dale

Nate the great
05-20-2009, 07:15 PM
What was wrong with idx (index) for these entries? It would make more sense the idpf I believe.

Dale

Ah, is that what "idx" stands for? That makes sense. I'd prefer to use something that will add meaning and be easier to identify. I'm using "idpf" as a placeholder.

Nate the great
05-20-2009, 10:45 PM
Why not use an existing XML-based dictionary format? Perhaps XDFX (http://xdxf.revdanica.com/). The ePub standard has the concept of XML in-line islands, but I don't have a clear idea of what it takes to produce a dictionary that is also a valid ePub document.

Interesting.

One problem is that XFDX isn't a standard yet. It's still in draft form. How can we adhere to something that will change in the near future? Also, I don't like that an article is identified by <ar> tag, and a keyword is identified by <k> tag. I'd prefer to spell out the whole word so it can be read easier.

Here (http://www.w3schools.com/Xml/default.asp) is a pretty good source of information on XML.

Nate the great
05-21-2009, 11:55 AM
One thing I forgot to add last night was that while I don't want to adopt XFXD, I think that it's a good source of ideas. My original goal for this project was to add the tags I wanted to use right now. I've since realized that it might be better to include a larger set of tags so they can be used for more purposes. This lessens the chance that the extension will need to be revised in the future.

DaleDe
05-21-2009, 12:02 PM
Interesting.

One problem is that XFDX isn't a standard yet. It's still in draft form. How can we adhere to something that will change in the near future? Also, I don't like that an article is identified by <ar> tag, and a keyword is identified by <k> tag. I'd prefer to spell out the whole word so it can be read easier.

Here (http://www.w3schools.com/Xml/default.asp) is a pretty good source of information on XML.

A agree with meaningful tag but short is preferred if you have to type it in. Perhaps you should input to XFDX, after all it is still in draft form.

Dale

GeoffC
05-21-2009, 12:21 PM
I don't know enough about how mobi/ePub work, nor how the dictionary function in mobi works - but grateful that someone is taking this on-board, if not for now then for the future.

Thanks....

Nate the great
05-21-2009, 10:37 PM
<idpf:article> </idpf:article> - required; root
<idpf:title> </idpf:title> - required; must be first in article element
<idpf:keyword> </idpf:keyword> - optional; can be nested inside any tag
<idpf:stub> </idpf:stub> - optional; part of article shown in a pop up window; can be anywhere
<idpf:entry> </idpf:entry> - required; has optional name attribute
<idpf:subentry> </idpf:subentry> - optional; has optional name attribute; must be inside entry or subentry
<idpf:data> </idpf:data> - required; has optional name attribute; has required type attribute: number, text, image, link, graph, (table?); must be inside entry or subentry

I wrote them with this (https://www.cia.gov/library/publications/the-world-factbook/geos/af.html) page in mind. It's not the most complex article; but it's up there. Note: XML is only for the data, not the formatting.

jgray
05-22-2009, 12:50 AM
Since MS Reader also does dictionaries (and nicely, too), I was curious what sort of markup Reader used. I downloaded the Dictionary Authoring Kit from here: http://www.microsoft.com/reader/developers/downloads/dak.aspx

It is a self extracting archive, so I just unzipped it and in the Documentation folder, found "dak.chm". It seems that Microsoft uses a subset of TEI tags for Reader dictionaries. Very interesting that they used an existing standard.

I wonder if this existing method would be good to incorporate into epub, rather than reinventing the wheel? Of course, regardless of what method is used, we still need to wait until reading software supports dictionary lookup.

Here is a sample entry that I pasted from the "dak.chm" file:


Sample Dictionary Fragment
A typical EBDICT dictionary fragment might look like this:

<tei-ms:text>
<tei-ms:body>
<tei-ms:div0>
<tei-ms:div1>
<tei-ms:div2>
<tei-ms:entry>
<tei-ms:form>
<tei-ms:orth>dictionary</tei-ms:orth>
<tei-ms:syll>dic|tion|ar|y</tei-ms:syll>
</tei-ms:form>
<tei-ms:gramGrp><tei-ms:pos>n</tei-ms:pos></tei-ms:gramGrp>
<tei-ms:sense n="1">
<tei-ms:def>
A reference book that contains words listed in alphabetical order and gives explanations of their meanings, often with additional information about grammar, pronunciation, and etymology.
</tei-ms:def>
</tei-ms:sense>
<tei-ms:sense n="2">
<tei-ms:def>
A foreign-language reference book of words: a reference book that gives equivalents of words and phrases in two or more languages, often with translations from each language to the other in separate sections.
</tei-ms:def>
<tei-ms:eg>A Spanish-English dictionary</tei-ms:eg>
</tei-ms:sense>
</tei-ms:entry>
</tei-ms:div2>
</tei-ms:div1>
</tei-ms:div0>
</tei-ms:body>
</tei-ms:text>

where:

<tei-ms:entry> delimits an entry
<tei-ms:orth> gives the orthographic (written) form of the headword
<tei-ms:syll> gives the syllabification
<tei-ms:pos> specifies the part of speech (in this case, a noun)
<tei-ms:sense> gives information about a particular sense of the word
<tei-ms:def> gives the definition of the word in that sense
<tei-ms:eg> gives an example of the usage of the word in that sense


I hope that the folks at IDPF are working on some type of dictionary format for epub. I commented last year over on Teleread that I thought dictionary support was something that was badly needed in epub.

Nate the great
05-27-2009, 12:46 PM
I wonder if this existing method would be good to incorporate into epub, rather than reinventing the wheel? Of course, regardless of what method is used, we still need to wait until reading software supports dictionary lookup.

I hope that the folks at IDPF are working on some type of dictionary format for epub. I commented last year over on Teleread that I thought dictionary support was something that was badly needed in epub.

I don't think the MSReader tags should be adopted, but you do have a good point. I'm now leaning towards recommending the adoption of XFXD tags as an extension to Epub. What does everyone think?

jgray
05-27-2009, 07:15 PM
I don't think the MSReader tags should be adopted, but you do have a good point. I'm now leaning towards recommending the adoption of XFXD tags as an extension to Epub. What does everyone think?

I wasn't saying that the MS tags should be used specifically. However, since MS based their tags on TEI, I was wondering if IDPF couldn't do the same? If not TEI, then some other existing standard. Since epub is already based on existing standards, this would make more sense than starting from scratch for dictionary support.

jgray
05-27-2009, 07:16 PM
BTW, do you have a link to some info about XFXD. Google isn't being helpful.

Nate the great
05-27-2009, 07:22 PM
Wallcraft posted a link early in the thread:
http://xdxf.revdanica.com/

I got the letter order wrong.

Nate the great
05-27-2009, 11:40 PM
I wasn't saying that the MS tags should be used specifically. However, since MS based their tags on TEI, I was wondering if IDPF couldn't do the same? If not TEI, then some other existing standard. Since epub is already based on existing standards, this would make more sense than starting from scratch for dictionary support.

A question occurred to me today that needs to be asked before going further. Given that TEI tags existed long before the Epub spec was finalized, why wasn't it included as a related standard? At the very least, why wasn't a subset of tags included in a manner similar to the preferred HTML vocabulary?

I wonder if there was a good reason for not using an existing standard. We may find out.

Peter Sorotokin
05-28-2009, 12:23 AM
I do not think XDXF would work well because it is a single file. Russian-English dictionary is 100 meg XML file: loading that in handheld memory would be challenging. So, at a minimum, single XML needs to be broken into pieces. Also, it is not an issue how to represent the content: it can be done either by CSS-styled XDXF snippets or CSS-styled XHTML with classes. This part is OK, no changes to the standard are required. The issue is how to build an index that can quickly guide reading system to the appropriate part of the content. Note that a single index file won't cut it - it will likely be too large. Some sort of hierarchical structure broken between several files is needed. That, I think, is an extension to EPUB that needs to be defined (or borrowed).

jgray
05-28-2009, 01:09 AM
Yes, the indexing system would be the most important part. I don't know how MS does their indexing, as it is internal to their Reader software. I do know that however they do it, it is very fast.

It is a shame that MS chose not to participate in IDPF. I think they could have made some useful contributions. But collaboration has never been something they were interested in.

HarryT
05-28-2009, 03:43 AM
What is the formal procedure for proposing an extension to the ePub standard? What is the likelyhood that an extension proposed by a "member of the public", as opposed to one of the companies who are on the standard committee, will actually get adopted?

kovidgoyal
05-28-2009, 12:58 PM
You don't really need a modification to the EPUb standard, the following should do the trick:

Split up the html containing the definitions into sub files each sub file containing only definitions for words starting with a specific set of two letters. There will be 26*26 = 676 such files. In the ncx just add navpoints for each file with a text being the two letters that the file has the words for. Then in the OPF file just add an entry indicating the EPUB is a dictionary. Now the reader software when asked for the definition of a word has to do the following:

parse 576 entries in the NCX file to find the correct html file. Parse the HTML file to find the word.

If two letters results in too large HTML files, use three letters instead.

The HTML files should be designed with minimal in file markup to speed up processing.

Nate the great
05-28-2009, 01:21 PM
I snipped some posts and moved them over here (http://www.mobileread.com/forums/showthread.php?t=47861). I had incorrect information, and took the discussion down the wrong path.

jgray
05-28-2009, 01:25 PM
Nice idea, but we still need whatever method is used to be included in the standard, so we have interoperability. Also, a set of tags specifically for dictionary markup is needed.

Whatever tags are used and whatever indexing/lookup method is chosen probably doesn't matter too much. We just need IDPF to do something, so that reader software wil have a standard to follow.

Hey IDPF--how about letting us know if anything is being done about this.

igorsk
05-29-2009, 09:57 AM
One thing that many dictionary formats miss is that one entry can be indexed by many headwords. This is quite important for languages like Japanese. For example, meguirau, 巡り会う, めぐり会う and めぐりあう are all different spellings of the same word and all should match the entry.

HarryT
05-30-2009, 02:24 AM
What is the formal procedure for proposing an extension to the ePub standard? What is the likelyhood that an extension proposed by a "member of the public", as opposed to one of the companies who are on the standard committee, will actually get adopted?

Can anybody answer these question, please? Nate?

Jellby
05-30-2009, 04:36 AM
I don't know, but there is a forum at https://www.idpf.org/forums/ ...

jgray
05-30-2009, 05:38 PM
I don't know, but there is a forum at https://www.idpf.org/forums/ ...

Did you notice the last time a post was made on those forums? Also, the number of questions that were never answered? Without someone at IDPF contributing a lot more to those forums, they are dead.

Nate the great
06-04-2009, 02:31 PM
What is the formal procedure for proposing an extension to the ePub standard? What is the likelyhood that an extension proposed by a "member of the public", as opposed to one of the companies who are on the standard committee, will actually get adopted?

I don't know, twice. I sent an email to Michael Smith, the IDPF executive director. I have not received a response.

zelda_pinwheel
06-04-2009, 02:35 PM
Hadrien is a member of idpf ; it's possible he could help or at least give some answer. also, garth conboy of eti is very involved and also very friendly ; you might want to contact him.

Nate the great
06-04-2009, 03:15 PM
Nice idea, but we still need whatever method is used to be included in the standard, so we have interoperability. Also, a set of tags specifically for dictionary markup is needed.

Yes and no. Do you really think anyone will want to write code that actually makes use of all the XDXF tags? I'm not so sure. Remember, the tags aren't absolutely necessary simply to have the information. If you want the information, you can use XHTML and simply add it as text.

I do think XDXF should be considered as an extension to Epub-after it achieves 1.0 status.

If we adopt this position as part of the proposal, then the current set of tags won't need to duplicate all or even most of the abilities of XDXF. Instead, we can look at this project as a set reference tags, not dictionary tags.

BTW, the set of tags I show here are enough to provide dictionary lookup similar to Mobipocket.

jgray
06-04-2009, 08:05 PM
I didn't say that the tags had to be XDXF. I don't really care how they are done. XHTML is fine. The point I was making was that some set of tags to markup dictionaries is needed and it needs to be standardized (by IDPF) so that the reader software can support it.

zelda_pinwheel
07-21-2009, 09:22 AM
so what is the latest on this project ? abelturd has started a thread over here (http://www.mobileread.com/forums/showthread.php?t=51579) to draft some kind of group request for idpf, it seems like an excellent complement to this thread.

GeoffC
10-16-2009, 04:56 AM
^ ^ ^ ^ ^ ^ ^ ^ ^
What Zelda said ......