View Full Version : Snipped from Proposal: Extending Epub


Nate the great
05-28-2009, 10:31 AM
NOTE: I moved these from another thread because I made a mistake and took the thread off topic.

I do not think XDXF would work well because it is a single file. Russian-English dictionary is 100 meg XML file: loading that in handheld memory would be challenging. So, at a minimum, single XML needs to be broken into pieces. Also, it is not an issue how to represent the content: it can be done either by CSS-styled XDXF snippets or CSS-styled XHTML with classes. This part is OK, no changes to the standard are required. The issue is how to build an index that can quickly guide reading system to the appropriate part of the content. Note that a single index file won't cut it - it will likely be too large. Some sort of hierarchical structure broken between several files is needed. That, I think, is an extension to EPUB that needs to be defined (or borrowed).

I see the index as a feature of the reader software, not the Epub format. That's how Mobipocket Reader does it. When you look at the index of a dictionary in MobiReader, what you see on the screen was generated on the fly by the software. You're not looking at the contents of a file.

The Kindle does create index files, true. But it indexes all the ebooks on the device, not just the ones with Mobipocket's reference tags. This is a software feature that can be implemented now, without using these tags.

Besides, I thought the purpose of these tags was to remove the need for a separate index file.

tompe
05-28-2009, 11:40 AM
I see the index as a feature of the reader software, not the Epub format. That's how Mobipocket Reader does it.

Are you sure about that? I have always assumed that a MobiPocket dictionary contains an index.

Nate the great
05-28-2009, 11:54 AM
Are you sure about that? I have always assumed that a MobiPocket dictionary contains an index.

I'm fairly certain that it's a result of the reader software, and not part of the ebook.

tompe
05-28-2009, 11:58 AM
I'm fairly certain that it's a result of the reader software, and not part of the ebook.

I am fairly certain that my Cybook does not index my dictionary. But I am a bit unsure about what you mean by indexing.

You realize that the html tags on MobiPockets homepage is not a description of the MobiPocket format? Running mobigen the html files can be converted to anything and an index can be added to the book.

DaleDe
05-28-2009, 12:29 PM
I am fairly certain that my Cybook does not index my dictionary. But I am a bit unsure about what you mean by indexing.

You realize that the html tags on MobiPockets homepage is not a description of the MobiPocket format? Running mobigen the html files can be converted to anything and an index can be added to the book.

Exactly, the performance issues clearly dictate that some sort of Index be provided. Searching the dictionary as linear database would take way too long.

Dale

Nate the great
05-28-2009, 12:59 PM
I am fairly certain that my Cybook does not index my dictionary. But I am a bit unsure about what you mean by indexing.

You realize that the html tags on MobiPockets homepage is not a description of the MobiPocket format? Running mobigen the html files can be converted to anything and an index can be added to the book.

I'm talking about something you can't do with the Cybook. It lacks the software because it's running an early generation of Mobipocket Java code.

DaleDe
05-28-2009, 01:02 PM
I'm talking about something you can't do with the Cybook. It lacks the software because it's running an early generation of Mobipocket Java code.

Actually the Cybook runs the latest generation of Mobipocket java code and does support dictionaries. It is currently the best implementation of Mobipocket available in any eBook device with the possible exception of the Kindle.

Dale

Nate the great
05-28-2009, 01:08 PM
Actually the Cybook runs the latest generation of Mobipocket java code and does support dictionaries. It is currently the best implementation of Mobipocket available in any eBook device with the possible exception of the Kindle.

Dale

No. The Hanlin V3 runs the latest version. This has been demonstrated (http://www.mobileread.com/forums/showthread.php?t=43964).

Nate the great
05-28-2009, 01:22 PM
Okay. I went and built my World Fact eBook again. The log shows the that the indexes are built in to the book. I was wrong.

DaleDe
05-28-2009, 01:35 PM
No. The Hanlin V3 runs the latest version. This has been demonstrated (http://www.mobileread.com/forums/showthread.php?t=43964).

the Hanlin has no dictionary support at all and its font support is quite limited. If it is the latest version then Cybook has done some serious modification. One thing I want to see in the Hanlin mobi support is to decrease the font size on superscipt items.

Nate the great
05-28-2009, 01:51 PM
the Hanlin has no dictionary support at all and its font support is quite limited. If it is the latest version then Cybook has done some serious modification. One thing I want to see in the Hanlin mobi support is to decrease the font size on superscipt items.

I hadn't known that the V3 doesn't have dictionary support. That's odd. Jinke must have screwed up when porting the software.

Peter Sorotokin
05-28-2009, 05:25 PM
I see the index as a feature of the reader software, not the Epub format. That's how Mobipocket Reader does it. When you look at the index of a dictionary in MobiReader, what you see on the screen was generated on the fly by the software. You're not looking at the contents of a file.

That's a legitimate point of view, of course. The problem is that building index on the device would be slow and drain the battery and building it elsewhere would mean that special software needs to be used to transfer the book to the device. I think that support for indexing is just too central for a dictionary to leave it out.

DaleDe
05-28-2009, 07:52 PM
That's a legitimate point of view, of course. The problem is that building index on the device would be slow and drain the battery and building it elsewhere would mean that special software needs to be used to transfer the book to the device. I think that support for indexing is just too central for a dictionary to leave it out.

I believe you are right. I suspect the easy way to generate an index is just to develop a linked list of all the words. Simple to navigate and does not need a separate list of words.

Dale

igorsk
05-29-2009, 10:57 AM
One thing that many dictionary formats miss is that one entry can be indexed by many headwords. This is quite important for languages like Japanese. For example, meguirau, 巡り会う, めぐり会う and めぐりあう are all different spellings of the same word and all should match the entry.

DaleDe
05-29-2009, 02:30 PM
One thing that many dictionary formats miss is that one entry can be indexed by many headwords. This is quite important for languages like Japanese. For example, meguirau, 巡り会う, めぐり会う and めぐりあう are all different spellings of the same word and all should match the entry.

The entry itself should list all the words that match it IMHO.

Dale

Nate the great
05-29-2009, 04:20 PM
That's a legitimate point of view, of course. The problem is that building index on the device would be slow and drain the battery and building it elsewhere would mean that special software needs to be used to transfer the book to the device. I think that support for indexing is just too central for a dictionary to leave it out.

I only wrote that because I misunderstood how Mobipocket handled indexes. I was also hoping this thread would die, but oh well.

I think it's safe to assume that the indexes will be made during the ebook creation process. I would suggest that each index in an ebook be in a separate file(s). Given that Epub is basically zipped HTML, an index will likely consist of 1 or more links (that lead to other places in the ebook).

I think we should consider copying the behavior of Mobipocket indexes. A link in the title index doesn't lead to the respective title. Instead, it leads to the beginning of the entry containing the title. (I also agree with Igorsk about the need for multiple head words.) In the keyword index, each keyword will listed once, and link to a separate file consisting of links to each of the entries containing the keyword. The links won't lead to the keyword, but to the beginning of the entry.

Peter Sorotokin
05-30-2009, 02:15 AM
I think it's safe to assume that the indexes will be made during the ebook creation process. I would suggest that each index in an ebook be in a separate file(s). Given that Epub is basically zipped HTML, an index will likely consist of 1 or more links (that lead to other places in the ebook).

I think we should consider copying the behavior of Mobipocket indexes. A link in the title index doesn't lead to the respective title. Instead, it leads to the beginning of the entry containing the title. (I also agree with Igorsk about the need for multiple head words.) In the keyword index, each keyword will listed once, and link to a separate file consisting of links to each of the entries containing the keyword. The links won't lead to the keyword, but to the beginning of the entry.

So what you saying is that an entry in the index will look like this (assuming specialized XML mark-up for index).

<entry href="QU.html#queen">queen</entry>
And corresponding dictionary article like that (assuming XHTML for the content):

<dl id="queen">
<dt>queen<dt>
<dd>a female sovereign or monarch</dd>
</dl>
That would work on the syntax level, but I don't think a flat index file containing all words is going to cut it: it is still going to be too big.

Also, can we used any existing mark-up (e.g. XHTML or perhaps NCX) for index file? Should we just use XHTML-based index with some metadata marking it as such?

I'll think a bit more about it.

Nate the great
05-30-2009, 08:28 AM
Actually, I was thinking of straightforward HTML for the index entry:

<a href="dictionary.html#d_somenumberX">dictionary</a><br />

It would have a corresponding link in the body of the ebook, of course.

HarryT
05-30-2009, 08:36 AM
It's certainly no problem having multiple headwords for a single entry in Mobi dictionaries. The Chambers dictionary I have on my Gen3 will, for example, find words with variant spellings - eg "center" or "centre".

Peter Sorotokin
05-30-2009, 02:12 PM
Actually, I was thinking of straightforward HTML for the index entry:

<a href="dictionary.html#d_somenumberX">dictionary</a><br />

It would have a corresponding link in the body of the ebook, of course.

I see. So here is a minimalistic proposal based on the discussion so far:

1. Add metadata tags (exact tags TBD) indicating that the EPUB is a dictionary, optional "input" language (the langauage that the dictionary articles are in is indicated by dc:language element), optional reference to the index file and optional collation declaration that describes the order of terms in the dictionary.

2. Dictionary should be split in multiple sections. In addition, an index file can optionally be provided. Index file should have linear="no" attribute in the spine. If an index is provideed, it should be referenced by the metadata.

3. Each entry in the dictionary must be formatted using XHTML dl tag. The first dt tag inside dl is considered to be a primary term. Dictionary entries must go in the order specified by collation - both inside a single section and across all sections as they are referenced in the spine.

4. Index is an XHTML file (exact structure TBD) that lists the sections of the dictionary itself (as opposed to supplementary material) and only the first term for each section. That both allows for efficient search and does not bloat the index.

Peter

Nate the great
05-30-2009, 03:47 PM
I see. So here is a minimalistic proposal based on the discussion so far:

1. Add metadata tags (exact tags TBD) indicating that the EPUB is a dictionary, optional "input" language (the langauage that the dictionary articles are in is indicated by dc:language element), optional reference to the index file and optional collation declaration that describes the order of terms in the dictionary.

2. Dictionary should be split in multiple sections. In addition, an index file can optionally be provided. Index file should have linear="no" attribute in the spine. If an index is provideed, it should be referenced by the metadata.

3. Each entry in the dictionary must be formatted using XHTML dl tag. The first dt tag inside dl is considered to be a primary term. Dictionary entries must go in the order specified by collation - both inside a single section and across all sections as they are referenced in the spine.

4. Index is an XHTML file (exact structure TBD) that lists the sections of the dictionary itself (as opposed to supplementary material) and only the first term for each section. That both allows for efficient search and does not bloat the index.

Peter

Why don't we try to limit this thread to just the discussion of the index?

1, yes.

2, I think an index should be required due to the need for a speedy lookup.

3, The dl tags seem to be duplicating what we are trying to do with the XML tags, and can't get achieve the specificity desired . Why use both?

4, Let me expand on what I wrote before.

A dictionary, for example, will have at a minimum title index(or something that will serve that purpose). It might also have one or more keyword indexes.

The title index will be in its own file that is separate from the the rest of the book as well as being separate from the other indexes. Each index will be in a separate file (or files) from the other indexes. If there is more than one type of keyword (example: "famous people" & "famous places"), each type of keyword will have its own index with its own files.

Here is where my explanation wasn't clear before. A keyword index, "famous people" for example, would be in the file "famous people_x.html". The entries would look like this:
<a href="johnny appleseed.html">Johnny Appleseed</a><br />
<a href="Kevin Costner.html">Kevin Costner</a><br />
etc.
The file "johnny appleseed.html" would contain entries something like this:
<a href="dictionary.html#d_somenumberX">an entry</a><br />
<a href="dictionary.html#d_somenumberY">another entry</a><br />
So a keyword index would actually consist of a group of files.

Peter Sorotokin
05-31-2009, 02:34 PM
Why don't we try to limit this thread to just the discussion of the index?

OK, but we'll need to discuss XHTML vs. TEI or XDFX. I am leaning towards using XHTML.

A dictionary, for example, will have at a minimum title index(or something that will serve that purpose). It might also have one or more keyword indexes.


While I see similarity between title index and keyword index, for practical purposes they may need to be treated somewhat differently (like in p-word). For foreign language dictionaries, title index is going to bloat to the same size as dictionary itself (since definitions a lot of times are as long as the link would be). On the other hand, each individual piece of the dictionary body is already self-indexing, since words go in alphabetical order.

On the other hand, keyword indices have to list every word (since they cannot rely on the document structure), but typically won't be as large (judging by the printed books). Also, in many cases, keyword index includes a short definition for each term, in addition to the link(s) to the book body. From that perspective, keyword indices are more similar to small dictionaries than to the title index.

Finally, my instincts are to avoid br tag. Wrap it in p, li, dt - whatever - instead.

Peter

Nate the great
06-02-2009, 10:10 PM
OK, but we'll need to discuss XHTML vs. TEI or XDFX. I am leaning towards using XHTML.


I would prefer XHTML because simpler is usually better.

While I see similarity between title index and keyword index, for practical purposes they may need to be treated somewhat differently (like in p-word). For foreign language dictionaries, title index is going to bloat to the same size as dictionary itself (since definitions a lot of times are as long as the link would be). On the other hand, each individual piece of the dictionary body is already self-indexing, since words go in alphabetical order.


I disagree about the entry length. I looked at the WordNet Mobi dictionary. The average length was at least twice as long as the link.

Also, while the entries of a dictionary are alphabetical, having a list of just headwords without the entries means you can look at and discard more entries at a time. This will make finding a word (with uncertain spelling) faster.

Question: would it be possible to build the headword index into the toc.ncx file? If so, could it behave like an index?


On the other hand, keyword indices have to list every word (since they cannot rely on the document structure), but typically won't be as large (judging by the printed books). Also, in many cases, keyword index includes a short definition for each term, in addition to the link(s) to the book body. From that perspective, keyword indices are more similar to small dictionaries than to the title index.

Peter

A definition would be in the body of the text, not the keyword index. I've never seen a reference title that had an index with definitions. I've seen books with both glossaries and indices, but they were separate entities.

Peter Sorotokin
06-04-2009, 11:04 AM
I disagree about the entry length. I looked at the WordNet Mobi dictionary. The average length was at least twice as long as the link.

Oh, but 2 is approximately 1 ;-). I bet a full index for 100M Russian-English dictionary is going to be at least 10M and my gut feeling tells me that's about 10 times more than practical.

Also, while the entries of a dictionary are alphabetical, having a list of just headwords without the entries means you can look at and discard more entries at a time. This will make finding a word (with uncertain spelling) faster.

You can think of my proposal as search tree (althouh very shallow). I think it is better for searches than flat array in almost all cases.

Question: would it be possible to build the headword index into the toc.ncx file? If so, could it behave like an index?

Per spec, I do not see how, but I'd rather someone else confirm it.

Peter Sorotokin
06-04-2009, 11:10 AM
BTW, judging by the level of interest that this thread generated, people care about dictionaries even less than I thought ;-)

DaleDe
06-04-2009, 12:33 PM
BTW, judging by the level of interest that this thread generated, people care about dictionaries even less than I thought ;-)

Do to the technical nature of this thread I don't see how you can jump to that conclusion. Most people don't care how a dictionary is implemented, they just want one.

Dale

jgray
06-04-2009, 02:25 PM
Do to the technical nature of this thread I don't see how you can jump to that conclusion. Most people don't care how a dictionary is implemented, they just want one.

Dale

I agree with Dale on this. Peter, perhaps you are thinking too much like an engineer and not enough like an average reader? As for interest in this thread, I have been following it closely, as I am sure others have.

Sabardeyn
06-04-2009, 02:38 PM
I've been following this discussion since the beginning, although I have not commented. The discussion is about a format that I don't use (mobi/mobipocket), have no personal knowledge of, and provides detailed discussion of the programming of same.

In other words, I don't see that I can contribute to the discussion in any constructive manner. So... :coffeebreak:

zelda_pinwheel
06-04-2009, 02:43 PM
Do to the technical nature of this thread I don't see how you can jump to that conclusion. Most people don't care how a dictionary is implemented, they just want one.

Dale

I agree with Dale on this. Peter, perhaps you are thinking too much like an engineer and not enough like an average reader? As for interest in this thread, I have been following it closely, as I am sure others have.

I've been following this discussion since the beginning, although I have not commented. The discussion is about a format that I don't use (mobi/mobipocket), have no personal knowledge of, and provides detailed discussion of the programming of same.

In other words, I don't see that I can contribute to the discussion in any constructive manner. So... :coffeebreak:

add my vote and agreement to that. i CARE about dictionaries, a LOT. and i desperately want dictionary support for epub. but i don't know anything about mobipocket to contribute. i have been following a bit though and the discussion interests me. and i'm very glad that nate has decided to tackle the question.

Valloric
06-06-2009, 11:09 PM
BTW, judging by the level of interest that this thread generated, people care about dictionaries even less than I thought ;-)

I would like to chime in with the rest and say that although I haven't added to the discussion, I'm following it closely. Dictionaries are important.

astra
06-07-2009, 08:32 AM
BTW, judging by the level of interest that this thread generated, people care about dictionaries even less than I thought ;-)

Because English is the first language for great majority of the forum. I don't bother with dictionary when(on a very rare occasions) I read books in Russian.