View Single Post
Old 12-25-2009, 05:33 PM   #1
rogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-books
Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
Encoding Dublin Core Metadata into XHTML

Sigil has started to recognize Dublin Core metadata that's encoded into (X)HTML. See brief discussion here.

There are two or three versions of the recommendations from DC on how to do it. I find that the latest one is pretty damn dry, repetitive, has errors and seems more focused on discussing terminology than on how to actually do it.

So I'm going to start a discussion here, and hope that I can get some help on how to do it properly.

The basic DC metadata is only 15 items: Contributor, Coverage, Creator, Date, Description, Format, Identifier, Language, Publisher, Relation, Rights, Source, Subject, Title, and Type.

You assign a value to each one, and you're good to go.

But of course it's slightly more complicated than that.

EBooks make some specific demands on this, and Section 2.2 of the ePub spec makes some minor changes to the DC metadata spec. ePubs:
  • must have at least the Identifier, Language and Title elements, and
  • the Identifier element must have an id attribute, and
  • the Identifier element may have a scheme attribute, while
  • Creator and Contributor elements may have a role attribute, and
  • Creator and Contributor elements may have a file-as attribute, and
  • the Date attribute may have an event attribute.

Which are very useful, and make a lot of sense.

But how to add it to the XHTML? Because the DC specs don't care, really, and don't deal with defining multiple elements well. You can have 4 Dates, 3 Contributors and 11 Creators -- and no way to distinguish between them. (Unless I'm misreading it -- let me know!)

Skipping over profiles and namespaces for the moment (I'm sure we'll get back to them), the basic XHTML 1.1 tag is pretty simple:

<meta name="DC.contributor" content="NAME" />
<meta name="DC.coverage" content="TOPIC" />
<meta name="DC.creator" content="NAME" />
<meta name="" content="YYYY(-MM(-DD))" />
<meta name="DC.description" content="DESCRIPTION/ACCOUNT" />
<meta name="DC.format" content="MEDIUM/FORMAT" />
<meta name="DC.identifier" content="UNIQUE-ID" />
<meta name="DC.language" content="LANGUAGE-CODE" />
<meta name="DC.publisher" content="NAME" />
<meta name="DC.relation" content="RELATED RESOURCE" />
<meta name="DC.rights" content="COPYRIGHT STATEMENT" />
<meta name="DC.source" content="DERIVED FROM" />
<meta name="DC.subject" content="KEYWORDS" />
<meta name="DC.title" content="TITLE" />
<meta name="DC.type" content="GENRE" />
The capitalized words are just surrogates for the actual entries in any given book. The name value might use DCTERMS instead of DC depending on namespaces (later for that.)

Overall, it's pretty good. But how to deal with the necessary extensions that ePub added? Let's keep in mind that XHTML eBooks ought to be thought of as an archival source (although they could be displayed on a physical reader) that is converted to whatever format is necessary. The idea is to be able to automate that conversion, such that typing in metadata is unnecessary. (And, as ePub is just wrapped XHTML, that conversion ought to be the simplest.)

The ePub metadata is an XML format -- we're trying to fit the flexibility of multiple tags into a single XHTML tag. It's doable, I think, but requires some thinking. Maybe that thinking has been done already?

There is an older DC-HTML spec, from 2003 that has the concept of "refinements". Those refinements were simply to extend the name attribute with additional dots -- those after the first were the "refinements". So you might have:

<meta name="" content="YYYY(-MM(-DD))" />
that allowed you to define the publication date.

Any thoughts or help?

m a r
rogue_ronin is offline   Reply With Quote