View Full Version : Encoding Dublin Core Metadata into XHTML


rogue_ronin
12-25-2009, 06:33 PM
Sigil has started to recognize Dublin Core metadata that's encoded into (X)HTML. See brief discussion here. (http://www.mobileread.com/forums/showthread.php?p=711005#post711005)

There are two or three versions of the recommendations from DC on how to do it. I find that the latest one (http://dublincore.org/documents/dc-html/) is pretty damn dry, repetitive, has errors and seems more focused on discussing terminology than on how to actually do it.

So I'm going to start a discussion here, and hope that I can get some help on how to do it properly.

The basic DC metadata is only 15 items: Contributor, Coverage, Creator, Date, Description, Format, Identifier, Language, Publisher, Relation, Rights, Source, Subject, Title, and Type.

You assign a value to each one, and you're good to go.

But of course it's slightly more complicated than that.

EBooks make some specific demands on this, and Section 2.2 of the ePub spec (http://www.idpf.org/2007/opf/opf2.0/download/#Section2.2) makes some minor changes to the DC metadata spec. ePubs:


must have at least the Identifier, Language and Title elements, and
the Identifier element must have an id attribute, and
the Identifier element may have a scheme attribute, while
Creator and Contributor elements may have a role attribute, and
Creator and Contributor elements may have a file-as attribute, and
the Date attribute may have an event attribute.


Which are very useful, and make a lot of sense.

But how to add it to the XHTML? Because the DC specs don't care, really, and don't deal with defining multiple elements well. You can have 4 Dates, 3 Contributors and 11 Creators -- and no way to distinguish between them. (Unless I'm misreading it -- let me know!)

Skipping over profiles and namespaces for the moment (I'm sure we'll get back to them), the basic XHTML 1.1 tag is pretty simple:


<meta name="DC.contributor" content="NAME" />
<meta name="DC.coverage" content="TOPIC" />
<meta name="DC.creator" content="NAME" />
<meta name="DC.date" content="YYYY(-MM(-DD))" />
<meta name="DC.description" content="DESCRIPTION/ACCOUNT" />
<meta name="DC.format" content="MEDIUM/FORMAT" />
<meta name="DC.identifier" content="UNIQUE-ID" />
<meta name="DC.language" content="LANGUAGE-CODE" />
<meta name="DC.publisher" content="NAME" />
<meta name="DC.relation" content="RELATED RESOURCE" />
<meta name="DC.rights" content="COPYRIGHT STATEMENT" />
<meta name="DC.source" content="DERIVED FROM" />
<meta name="DC.subject" content="KEYWORDS" />
<meta name="DC.title" content="TITLE" />
<meta name="DC.type" content="GENRE" />


The capitalized words are just surrogates for the actual entries in any given book. The name value might use DCTERMS instead of DC depending on namespaces (later for that.)

Overall, it's pretty good. But how to deal with the necessary extensions that ePub added? Let's keep in mind that XHTML eBooks ought to be thought of as an archival source (although they could be displayed on a physical reader) that is converted to whatever format is necessary. The idea is to be able to automate that conversion, such that typing in metadata is unnecessary. (And, as ePub is just wrapped XHTML, that conversion ought to be the simplest.)

The ePub metadata is an XML format -- we're trying to fit the flexibility of multiple tags into a single XHTML tag. It's doable, I think, but requires some thinking. Maybe that thinking has been done already?

There is an older DC-HTML spec, (http://http://dublincore.org/documents/2003/11/30/dcq-html/) from 2003 that has the concept of "refinements". Those refinements were simply to extend the name attribute with additional dots -- those after the first were the "refinements". So you might have:

<meta name="DC.date.publication" content="YYYY(-MM(-DD))" />

that allowed you to define the publication date.

Any thoughts or help?

m a r

DaleDe
12-25-2009, 08:44 PM
YOu might want to look at XMP. Here is a wiki on the topic.
http://www.w3.org/2008/WebVideo/Annotations/wiki/XMP

Dale

KevinH
12-26-2009, 02:02 PM
Hi,

I contributed the first version of the dc code to Sigil (with a lot of help from Valloric!) and I agree with your assessment of the 2008 spec. It is completely useless.

It seems to assumes that the metadata in an xhtml/html document will always be embedded in a device that has live network access and as such uses links to abstract classes to implement many of their standards.

This is a huge assumption! It means that no off-line reading will ever be done and that xhtml/html files are always and only served up by webservers.

I, of course, threw-out the 2008 document as completely worthless and went with the 2003 document since in all cases ebooks may be read off line and are certainly not being served up by a webserver (as in the case of epub). The funny thing is that the DC website points out no one seems to want to implement 2008 specs and are instead using the 2003 specs and then complains about it. They really are clueless.

So I tried to implment their regular "dc" namespace and the "dcterms" namespace BUT ONLY where it overlaps with the epub standard.

I also assumed that case was not important in the "name" field but that it was relevant in the "content" and "scheme" fields. The problem is that many refinements are used and stored in the name field, so I had to work around that.

I also wanted to support free-form html metadata as generated from pml (eReader er.pdb" books (Title, Author, Publisher, Copyright, and EISBN) as well as simplistic attempts by others where it overlapped with the epub metadata spec.

To give you some idea of the range of things supported, here is one of my test cases:

<meta name="Title" content="Test Case" />
<meta name="Author" content="Kevin Hendricks" />
<meta name="Copyright" content="Copyright &copy; 2005, 2006" />
<meta name="Publisher" content="My Super PublishingHouse" />
<meta name="EISBN" content="0-06-124666-2" />
<meta name="DC.contributor" content="Another me" />
<meta name="DC.contributor.aut" content="Another me1" />
<meta name="DC.contributor.arc" content="Another me2" />
<meta name="dc.date" content="2009-12-15" />
<meta name="dc.date.modified" content="2009-12-16" />
<meta name="DCTERMS.issued" content="2008-10-22" />
<meta name="dcterms.creator.aut" content="Another me3" />
<meta name="dc.identifier" scheme="ISSN" content="123456789" />
<meta name="dcterms.identifier.doi" content="987654321" />
<meta name="dc.identifier.lccs" content="123-123-123-123" />

Please note, the last line is a valid metadata identifier under DC but it will be ignored by Sigil since it is not one of their supported internal formats for identifiers.

Also note, that like you, I ignored ALL <link> fields since the book may be read off-line.

All of the rest do something.

Another thing I have not supported yet (again because there was no place for it in the internal Sigil structure) is "refinements" on the "Relation" field such as" "IsPartOf", "IsVersionOf", "IsFormatOf", "IsReferencedBy", "IsBasis For", "IsBasedOn", and "Requires".

The **internal** structure of Sigil supports the following data items - Please note that everything from the metadata must be mapped to one of these to be supported. If not, it will be ignored since there was no place to store it internally inside Sigil (which focused specifically on the official epub standard for metadata)


Title
Author (or *any* of the Marc relator codes)
Subject
Descriptions
Publisher
Date of publication
Date of creation
Date of modification
Type
Format
Relation
Coverage
Rights
ID (must be one of DOI, ISBN, ISSN, or CustomID)

It sounds like Sigil has decided to add "published" as a dcterm to augment "issued" which makes a lot of sense but not one of the official dcterms.

Please ask me and I can tell you what is supported and I would be happy to offer a patch to Sigil to support something that is very important to you as long as it fits with the epub metadata spec - and of course it is acceptable by the author of Sigil!!!

Hope this helps,

KevinH

rogue_ronin
12-26-2009, 07:32 PM
Kevin --

It's great to be able to comm with you. You've answered a bunch of questions, and of course I have more.

The 2008 document is an unreadable mess, full of jargon and ridiculous assumptions. Glad to hear about your use of the 2003 document, and your somewhat "hybridized" approach to solving the problem.

Beyond the DC, you seem to be supporting basic tags that you can assume folks might have: Publisher, Author, Editor, etc. Is that right? What tags are you supporting? I use my own set, that work specifically with my macro library and are automatically updated, but may not be universal enough -- example:


<!-- BEGIN: EBOOK META INFORMATION -->

<meta name="FileName" content="BookTitle.html" />
<meta name="FileID" content="xBook_00000099" />
<meta name="FileCreationDate" content="2009-12-09" />
<meta name="FileVersion" content="0.10" />
<meta name="FileRevisionDate" content="2009-12-09" />
<meta name="FileSource" content="Found Web Text" />
<meta name="FileScanner" content="Unknown" />
<meta name="FileProofer" content="FoundText" />
<meta name="FileComment" content="Created with xBook NoteTab Clip Library" />

<meta name="Title" content="Book Title" />
<meta name="SubTitle" content="Subtitle" />
<meta name="Series" content="Examples" />
<meta name="SeriesNumber" content="16" />

<meta name="Author" content="First·Middle·LastName·" />
<meta name="Illustrator" content="First·Middle·LastName·" />
<meta name="Editor" content="First·Middle·LastName·" />
<meta name="CoverDesigner" content="Name" />

<meta name="Genre" content="Howto" />
<meta name="ISBN" content="123456789X" />
<meta name="Language" content="en" />
<meta name="Description" content="An example document for discussing metadata." />
<meta name="Keywords" content="metadata,example" />

<meta name="Publisher" content="FoundText" />
<meta name="PublicationDate" content="2009-12" />
<meta name="PublicationCity" content="Honolulu" />

<meta name="CopyrightHolder" content="none" />
<meta name="CopyrightDate" content="none" />
<meta name="CopyrightLicense" content="Public Domain" />

<!-- END: EBOOK META INFORMATION -->



Which of these would you catch with Sigil?

Now, even if you catch some of it, I don't expect to be able to use it as is. I've already explored some changes:


<!-- BEGIN: EBOOK META INFORMATION -->

<meta name="FileName" content="BookTitle.html" />
<meta name="DC.identifier" scheme="xBook" content="xBook_00000099" />
<meta name="DC.date.created" content="2009-12-09" />
<meta name="FileVersion" content="0.10" />
<meta name="DC.date.modified" content="2009-12-09" />
<meta name="DC.source" content="Found Web Text" />
<meta name="FileScanner" content="Unknown" />
<meta name="DC.contributor.pfr" content="FoundText" />
<meta name="FileComment" content="Created with xBook NoteTab Clip Library" />

<meta name="DC.title" content="Book Title" />
<meta name="SubTitle" content="Subtitle" />
<meta name="Series" content="Examples" />
<meta name="SeriesNumber" content="16" />

<meta name="DC.creator.aut" content="First·Middle·LastName·" />
<meta name="DC.contributor.ill" content="First·Middle·LastName·" />
<meta name="DC.contributor.edt" content="First·Middle·LastName·" />
<meta name="DC.contributor.cov" content="Name" />

<meta name="DC.type" content="Howto" />
<meta name="DC.identifier" scheme="ISBN" content="123456789X" />
<meta name="DC.language" content="en" />
<meta name="DC.description" content="An example document for discussing metadata." />
<meta name="DC.subject" content="metadata,example" />

<meta name="DC.publisher" content="FoundText" />
<meta name="DC.date.published" content="2009-12" />
<meta name="PublicationCity" content="Honolulu" />

<meta name="CopyrightHolder" content="none" />
<meta name="DC.date.copyrighted" content="none" />
<meta name="DC.rights" content="Public Domain" />

<!-- END: EBOOK META INFORMATION -->


Which should mostly work with Sigil, right? The only meta-conflict I can see is DC.date.copyrighted which is not mentioned in the DC-HTML docs, but seems useful to me.

The name layout also might not be good -- I use an explicit marker for separating out first-middle-last. I do this so that I can generate names in the format LastName, First Middle or (as my macro library does file management) to create folders named LastName_First_Middle. How do you support the "file-as" attribute of the ePub spec?

I'm also of a mind that versioning info might be good. You'll note that I keep info about the file itself, and info about the document contents, somewhat distinguished. May not suit Sigil though...

What do you think about something like:


<meta name="DC.relation.series" content="Examples" />
<meta name="DC.relation.series.number" content="16" />


or


<meta name="DC.title.sub" content="Subtitle" />
?

The two DC terms that you don't say you support are: Language, and Source. Would you consider them? Or are they unnecessary?

Also, you wrote:
ID (must be one of DOI, ISBN, ISSN, or CustomID)

What is CustomID? Would my scheme="xBook" be okay?

And finally, the namespace can be either DC or DCTERMS, correct?

Hope that's not too much, thanks!

m a r

Valloric
12-26-2009, 08:20 PM
Another thing I have not supported yet (again because there was no place for it in the internal Sigil structure) is "refinements" on the "Relation" field such as" "IsPartOf", "IsVersionOf", "IsFormatOf", "IsReferencedBy", "IsBasis For", "IsBasedOn", and "Requires".

All of these will not be implemented in Sigil since there's no way to export them to epub [see EDIT]. The spec doesn't support them.

And no, I don't like the idea of custom meta elements. EDIT: I still don't like this idea... but on further reflection I believe it may be wise to allow pass-through of unsupported <meta> elements and their attributes.

So if there's a meta element in an OPF or HTML that Sigil can't map to something epub-specific, it should be stored as such and then exported in the final epub's OPF without harm.


The **internal** structure of Sigil supports the following data items - Please note that everything from the metadata must be mapped to one of these to be supported. If not, it will be ignored since there was no place to store it internally inside Sigil (which focused specifically on the official epub standard for metadata)

Again, only the metadata that can be represented in an epub document is supported by Sigil. Anything else is unsupported. [see above]


It sounds like Sigil has decided to add "published" as a dcterm to augment "issued" which makes a lot of sense but not one of the official dcterms.

Yes, I know, but someone's going to type it in eventually and expect it to be recognized. And you can't blame them: "created" for creation, "modified" for modification but "issued" for publication?

Someone's going to use it, it will fail and the I'll get an issue report. May as well just preempt them.

KevinH
12-26-2009, 10:07 PM
Kevin --

> Beyond the DC, you seem to be supporting basic tags that you can assume folks might have: Publisher, Author, Editor, etc. Is that right?

Actually, I only support a few of the most basic, and these must map to the epub standard that Sigil uses internally.

The free-form ones I handle are:

"Title"
"Author"
"Subject"
"Description"
"Publisher"
"Date of publication"
"Date of creation"
"Date of modification"
"Type"
"Format"
"Relation"
"Coverage"
"Rights"
"Copyright"
"EISBN"
"ISSN"
"ISBN"
"CustomID"
"DOI"


The only reason these are supported and not others is that they map to the epub standard directly.

Your example has a much larger set, most of which would be ignored as things currently stands.



> Now, even if you catch some of it, I don't expect to be able to use it as is. > I've already explored some changes:

<meta name="DC.date.created" content="2009-12-09" />
<meta name="DC.date.modified" content="2009-12-09" />
<meta name="DC.contributor.pfr" content="FoundText" />
<meta name="DC.title" content="Book Title" />
<meta name="DC.creator.aut" content="First·Middle·LastName·" />
<meta name="DC.contributor.ill" content="First·Middle·LastName·" />
<meta name="DC.contributor.edt" content="First·Middle·LastName·" />
<meta name="DC.contributor.cov" content="Name" />

<meta name="DC.type" content="Howto" />
<meta name="DC.identifier" scheme="ISBN" content="123456789X" />
<meta name="DC.language" content="en" />
<meta name="DC.description" content="An example document for discussing metadata." />
<meta name="DC.subject" content="metadata,example" />

<meta name="DC.publisher" content="FoundText" />
<meta name="DC.rights" content="Public Domain" />

Are the ones in your second example that work now (actually the "published" one would work too given Valloric's change.

> How do you support the "file-as" attribute of the ePub spec?

It is simply used to replace the person's name in the epub spec if it is present. It is not supported in the html dc set since it is not part of the dc spec.

> What do you think about this


<meta name="DC.relation.series" content="Examples" />
<meta name="DC.relation.series.number" content="16" />


or


<meta name="DC.title.sub" content="Subtitle" />
?


I only handle single level refinements right now. Multiple layers of refinements are not part of the spec as far as I could determine.

> The two DC terms that you don't say you support are: Language, and Source. Would you consider them? Or are they unnecessary?

dc.language is supported, it is basic and handled in a pull down in Sigil.

> What is CustomID? Would my scheme="xBook" be okay?

I think those are actually ignored in the final run but Valloric would be the best one to ask that.

> And finally, the namespace can be either DC or DCTERMS, correct?

Yes where they overlap, but if it is something specific to the DCTERMS namespace it is better to prefix it with DCTERMS (for example - dcterms.modified for the date of modification.


m a r

You have a long list and many of the items you have would be interesting extensions. Right now, I have only focused on handling the epub spec.

The epub spec does allow for non dc to be passed through but handling that would be completely up to Valloric to decide if and how he wants it.

If he says "yes" I would be happy to take a shot at working with you to come up with a reasonable list of non-dc and extended dc terms to handle.

Take care,

KevinH

rogue_ronin
12-26-2009, 10:35 PM
All of these will not be implemented in Sigil since there's no way to export them to epub [see EDIT]. The spec doesn't support them.

And no, I don't like the idea of custom meta elements. EDIT: I still don't like this idea... but on further reflection I believe it may be wise to allow pass-through of unsupported <meta> elements and their attributes.

So if there's a meta element in an OPF or HTML that Sigil can't map to something epub-specific, it should be stored as such and then exported in the final epub's OPF without harm.

I think that I've been thinking about how does an ePub file fit into a collection of such files, and how might other software use the metadata in ways that are unnecessary to the simple (!) production of an individual ebook.

Sigil doesn't need to support that metadata; it's irrelevant. But is there a way to encode that extra metadata in a way that is consistent with the DC terms? The ePub spec uses both <dc:term> and <opf:term> is there another set of terms that could be used to code these other metadata?

What Kevin wrote also seems to suggest that there are many "common" non-spec meta elements that can be mapped to the spec. It seems to me that something like using the relation tag for series is within the spec (at least as given in 2003).

I'm both thinking "out loud" here about how to encode the other metadata, and trying to nail down what code/terms exactly Sigil is currently supporting because it seems to be the first application that is taking such metadata in XHTML into account.

An explicit list of such terms would be useful to those of us who encode primarily in XHTML first and then convert. At some point, converters like Calibre may be able to find it too.

m a r

ps: Yesterday, for example, I imported 181 Doc Savage HTML files into Calibre as an experiment. It didn't even search the <title> tag to find the title. Just took the title from the file name. Now there may be a setting for that, I'll have to look again, but it seems obvious to me to look for metadata when importing HTML.

rogue_ronin
12-27-2009, 12:08 AM
>The free-form ones I handle are:

"Title"
"Author"
"Subject"
"Description"
"Publisher"
"Date of publication"
"Date of creation"
"Date of modification"
"Type"
"Format"
"Relation"
"Coverage"
"Rights"
"Copyright"
"EISBN"
"ISSN"
"ISBN"
"CustomID"
"DOI"

The only reason these are supported and not others is that they map to the epub standard directly.

Sure thing. And they at least somewhat match what any lay-person might put into their metadata.

I assume that they're in the standard <meta> format.

Your example has a much larger set, most of which would be ignored as things currently stand.

Yeah, I'm not surprised. I tend to overdo it. Some of the metadata that I store is for use in a local flat-file database, and a library structure. Storing it in the document itself seems the most logical -- it lets me keep the master list much smaller and more manageable (holding only the ebook location, the UniqueID, the title and the author.) Whenever I open a book to edit, I just parse out the information from the head.


<meta name="DC.date.created" content="2009-12-09" />
<meta name="DC.date.modified" content="2009-12-09" />
<meta name="DC.contributor.pfr" content="FoundText" />
<meta name="DC.title" content="Book Title" />
<meta name="DC.creator.aut" content="First·Middle·LastName·" />
<meta name="DC.contributor.ill" content="First·Middle·LastName·" />
<meta name="DC.contributor.edt" content="First·Middle·LastName·" />
<meta name="DC.contributor.cov" content="Name" />
<meta name="DC.type" content="Howto" />
<meta name="DC.identifier" scheme="ISBN" content="123456789X" />
<meta name="DC.language" content="en" />
<meta name="DC.description" content="An example document for discussing metadata." />
<meta name="DC.subject" content="metadata,example" />
<meta name="DC.publisher" content="FoundText" />
<meta name="DC.rights" content="Public Domain" />

Are the ones in your second example that work now (actually the "published" one would work too given Valloric's change.

So,

<meta name="DC.date.published" content="2009-12" />

is okay now too. Good.

As for,

<meta name="DC.date.copyrighted" content="none" />

I get it -- it is valid by the 2003 DC-HTML spec, but doesn't map to the ePub spec.

But,

<meta name="DC.source" content="Found Web Text" />

should be okay, shouldn't it? It's both DC and ePub spec, right?

And

<meta name="DC.identifier" scheme="xBook" content="xBook_00000099" />

is a CustomID isn't it? Or are you hardcoding for the text CustomID?


> How do you support the "file-as" attribute of the ePub spec?

It is simply used to replace the person's name in the epub spec if it is present. It is not supported in the html dc set since it is not part of the dc spec.


I think it's there to pre-process complicated names, so that machines can sort properly. It doesn't fully replace the name, I think. But I see that there seems to be no obvious way to encode it in DC terms. Should you support the OPF terms of the ePub spec? Maybe something like:

<meta name="DC.creator" opf:role="aut" opf:file-as="LastName, First Middle" content="First Middle LastName" />

It'd simplify the "refinements" thing, maybe? Does it break XHTML?

I only handle single level refinements right now. Multiple layers of refinements are not part of the spec as far as I could determine.

It's mentioned somewhere, in the 2008 spec I think, that any further refinements simply become part of the first refinement -- but who knows for sure, as the liquid was sucked from my body while reading it.

As I think about this, it seems that the limitations on both the DC and ePub specs keep coming into play. It'd be nice if someone had a more thorough and more specifically ebook spec.

> What is CustomID? Would my scheme="xBook" be okay?

I think those are actually ignored in the final run but Valloric would be the best one to ask that.

I haven't yet managed to successfully build an ePub by hand (well, by macro), but I think that the Unique ID can be anything.

> And finally, the namespace can be either DC or DCTERMS, correct?

Yes where they overlap, but if it is something specific to the DCTERMS namespace it is better to prefix it with DCTERMS (for example - dcterms.modified for the date of modification.

Hmmm, what are the extra terms? (I ask you so that I can retain my liquid essence by avoiding the 2008 spec. I'm selfish like that.)

You have a long list and many of the items you have would be interesting extensions. Right now, I have only focused on handling the epub spec.

The epub spec does allow for non dc to be passed through but handling that would be completely up to Valloric to decide if and how he wants it.

If he says "yes" I would be happy to take a shot at working with you to come up with a reasonable list of non-dc and extended dc terms to handle.

Be glad to help if it ends up being a good idea. You've certainly helped me to think out my stuff.

Next post, I'll try to assemble a simple list of valid DC meta terms that can be used when hand-coding prior to importing to Sigil.

Thanks!

m a r

rogue_ronin
12-27-2009, 12:54 AM
Title
<meta name="Title" content="TITLE" />
<meta name="DC.title" content="TITLE" />
<meta name="DCTERMS.title" content="TITLE" />

Author
<meta name="Author" content="NAME" />
<meta name="DC.creator.aut" content="NAME" />
<meta name="DCTERMS.creator.aut" content="NAME" />

Subject
<meta name="Subject" content="KEYWORD(S)" />
<meta name="DC.subject" content="KEYWORD(S)" />
<meta name = "DCTERMS.subject" content="KEYWORD(S)" />

Description
<meta name="Description" content="DESCRIPTION OF CONTENT" />
<meta name="DC.description" content="DESCRIPTION OF CONTENT" />
<meta name="DCTERMS.description" content="DESCRIPTION OF CONTENT" />

Publisher
<meta name="Publisher" content="PUBLISHER DATA" />
<meta name="DC.publisher" content="PUBLISHER DATA" />
<meta name="DCTERMS.publisher" content="PUBLISHER DATA" />

Publication Date
<meta name="Date of publication" content="YYYY(-MM(-DD))" />
<meta name="DC.date.published" content="YYYY(-MM(-DD))" />
<meta name="DC.date.publication" content="YYYY(-MM(-DD))" />
<meta name="DC.date.issued" content="YYYY(-MM(-DD))" />
<meta name="DCTERMS.issued" content="YYYY(-MM(-DD))" />

Creation Date
<meta name="Date of creation" content="YYYY(-MM(-DD))" />
<meta name="DC.date.created" content="YYYY(-MM(-DD))" />
<meta name="DC.date.creation" content="YYYY(-MM(-DD))" />
<meta name="DCTERMS.created" content="YYYY(-MM(-DD))" />

Modification Date
<meta name="Date of modification" content="YYYY(-MM(-DD))" />
<meta name="DC.date.modified" content="YYYY(-MM(-DD))" />
<meta name="DC.date.modification" content="YYYY(-MM(-DD))" />
<meta name="DCTERMS.modified" content="YYYY(-MM(-DD))" />

Type
<meta name="Type" content="GENRE or CLASSIFICATION" />
<meta name="DC.type" content="GENRE or CLASSIFICATION" />
<meta name="DCTERMS.type" content="GENRE or CLASSIFICATION" />

Format
<meta name="Format" content="MEDIA/FILE TYPE" />
<meta name="DC.format" content="MEDIA/FILE TYPE" />
<meta name="DCTERMS.format" content="MEDIA/FILE TYPE" />

Relation
<meta name="Relation" content="RELATED RESOURCE" />
<meta name="DC.relation" content="RELATED RESOURCE" />
<meta name="DCTERMS.relation" content="RELATED RESOURCE" />

Coverage
<meta name="Coverage" content="TIME, SPACE or OTHER SPAN" />
<meta name="DC.coverage" content="TIME, SPACE or OTHER SPAN" />
<meta name="DCTERMS.coverage" content="TIME, SPACE, or OTHER SPAN" />

Rights
<meta name="Rights" content="COPYRIGHT STATUS" />
<meta name="Copyright" content="COPYRIGHT STATUS" />
<meta name="DC.rights" content="COPYRIGHT STATUS" />
<meta name="DCTERMS.rights" contents="COPYRIGHT STATUS" />

Language
<meta name="DC.language" content="TWO-LETTER LANGUAGE CODE" />
<meta name="DCTERMS.language" content="TWO-LETTER LANGUAGE CODE" />

Source
<meta name="Source" content="SOURCE DERIVED FROM" />
<meta name="DC.source" content="SOURCE DERIVED FROM" />
<meta name="DCTERMS.source" content="SOURCE DERIVED FROM" />

EISBN
<meta name="EISBN" content="EISBN CODE" />
<meta name="DC.identifier" scheme="EISBN" content="EISBN CODE" />
<meta name="DC.identifier.EISBN" content="EISBN CODE" />
<meta name="DCTERMS.identifier.EISBN" content="EISBN CODE" />
<meta name="DCTERMS.identifier" scheme="EISBN" content="EISBN CODE" />

ISSN
<meta name="ISSN" content="ISSN CODE" />
<meta name="DC.identifier" scheme="ISSN" content="ISSN CODE" />
<meta name="DC.identifier.ISSN" content="ISSN CODE" />
<meta name="DCTERMS.identifier.ISSN" content="ISSN CODE" />
<meta name="DCTERMS.identifier" scheme="ISSN" content="ISSN CODE" />

ISBN
<meta name="ISBN" content="ISBN CODE" />
<meta name="DC.identifier" scheme="ISBN" content="ISBN CODE" />
<meta name="DC.identifier.ISBN" content="ISBN CODE" />
<meta name="DCTERMS.identifier.ISBN" content="ISBN CODE" />
<meta name="DCTERMS.identifier" scheme="ISBN" content="ISBN CODE" />

CustomID
<meta name="CustomID" content="CustomID CODE" />
<meta name="DC.identifier" scheme="CustomID" content="CustomID CODE" />
<meta name="DC.identifier.CustomID" content="CustomID />
<meta name="DCTERMS.identifier.CustomID" content="CustomID" />
<meta name="DCTERMS.identifier" scheme="CustomID" content="CustomID" />

DOI
<meta name="DOI" content="DOI CODE" />
<meta name="DC.identifier" scheme="ISBN" content="ISBN CODE" />
<meta name="DC.identifier.DOI" content="DOI CODE" />
<meta name="DCTERMS.identifier.DOI" content="DOI CODE" />
<meta name="DCTERMS.identifier" scheme="DOI" content="DOI CODE" />

==================

Any additional creator or contributor may be added using the over 200 MARC Relator Codes (http://www.loc.gov/marc/relators/relacode.html):

Illustrator
<meta name="DC.creator.ill" content="NAME" />

Proofreader
<meta name="DC.contributor.pfr" content="NAME" />

Editor
<meta name="DC.contributor.edt" content="NAME" />

Cover Designer
<meta name="DC.contributor.cov" content="NAME" />

==================

Please comment and correct: I will update this entry.

Thanks,

m a r

KevinH
12-27-2009, 01:06 AM
Hi,

Some quick answers to your questions before I head to slepp ...

Yes, I forgot about "Source" and yes it is supported both as a free-form tag and as DC.source.

Language is only supported as a DC.language and not as free-form right now.

Under xhtml the meta tag only has the following allowed fields as far as I know: "name", "content", "scheme", and "http-equiv" along with the following attributes: "dir", "lang", and "xml:lang" so I do not think it is proper xhtml to use opf:role or any of the other field values inside a meta tag and that is why DC came up with refinements.

The full set of DCTERMs. is given by:
http://dublincore.org/documents/dcmi-terms/

So when you design you approach, we should probably try to map as many things to official dc. and dcterms. as possible since those could be passed through using the same dublin core schema and link.

"CustomID" is the exact string that must be used in the scheme of the Identifier. Using refinements in xhtml it might look like the following:

name="DC.identifier.CustomID" content="123456789101234"

inside the meta tag for example.

Hope something here helps.

Kevin

rogue_ronin
12-27-2009, 01:44 AM
Cool. I've updated the post, hopefully someone will find it useful.

Note that I've used both the verb and the noun form of create, modify and publish. In the Sigil forum, Valloric indicated he would support both. If that's no longer correct, let me know.

I checked the <meta> tag at w3schools right after posting; you're correct. No way to sneak that in as an attribute.

But how about this?


<meta name="OPF.file-as" content="LastName, First Middle" />


You'd only need one, presumably, matching the primary creator. It's not perfectly to spec, though, as each creator and contributor is allowed to have that attribute.

Or could OPF become a scheme? ie:

<meta name="DC.creator.aut" content="First Middle LastName" />
<meta name="DC.creator.aut" scheme="OPF:file-as" content="LastName First Middle" />


Since file-as and name display are unlikely to get mixed up, you could have a double-entry for each creator or contributor.

At this point, we've drifted beyond Sigil. I'm getting a little loopy, too.

m a r

Valloric
12-27-2009, 08:40 AM
Cool. I've updated the post, hopefully someone will find it useful.

I find it very useful. Do you mind if I use this one day when I find the time to write a manual for Sigil?


Note that I've used both the verb and the noun form of create, modify and publish. In the Sigil forum, Valloric indicated he would support both. If that's no longer correct, let me know.

This is already (http://code.google.com/p/sigil/source/detail?r=741ab1c414816159d405d845527b39536759a6e2# ) in the trunk.


I checked the <meta> tag at w3schools right after posting; you're correct. No way to sneak that in as an attribute.

But how about this?


<meta name="OPF.file-as" content="LastName, First Middle" />
You'd only need one, presumably, matching the primary creator. It's not perfectly to spec, though, as each creator and contributor is allowed to have that attribute.


Yeah, this can't work. We need to have creator or contributor specified. No guessing.


Or could OPF become a scheme? ie:

<meta name="DC.creator.aut" content="First Middle LastName" />
<meta name="DC.creator.aut" scheme="OPF:file-as" content="LastName First Middle" />
Since file-as and name display are unlikely to get mixed up, you could have a double-entry for each creator or contributor.

This too is a bad idea. I could make Sigil recognize that these two tags actually specify one author, but every Reading System on the planet is going to read it as two. Using just the second meta tag could work though.

In any event, as I've said, Sigil should store any meta tags it can't recognize and convert to DC. These should then be exported to the OPF as they were, that is, bare meta tags (the spec supports this).

KevinH
12-27-2009, 12:17 PM
Title
<meta name="Title" content="TITLE" />
<meta name="DC.title" content="TITLE" />
<meta name="DCTERMS.title" content="TITLE" />


Author
<meta name="Author" content="NAME" />
<meta name="DC.creator.aut" content="NAME" />
<meta name="DCTERMS.creator.aut" content="NAME" />

Subject
<meta name="Subject" content="KEYWORD(S)" />
<meta name="DC.subject" content="KEYWORD(S)" />
<meta name = "DCTERMS.subject" content="KEYWORD(S)" />

Description
<meta name="Description" content="DESCRIPTION OF CONTENT" />
<meta name="DC.description" content="DESCRIPTION OF CONTENT" />
<meta name="DCTERMS.description" content="DESCRIPTION OF CONTENT" />

Publisher
<meta name="Publisher" content="PUBLISHER DATA" />
<meta name="DC.publisher" content="PUBLISHER DATA" />
<meta name="DCTERMS.publisher" content="PUBLISHER DATA" />

Publication Date
<meta name="Date of publication" content="YYYY(-MM(-DD))" />
<meta name="DC.date.published" content="YYYY(-MM(-DD))" />
<meta name="DC.date.publication" content="YYYY(-MM(-DD))" />
<meta name="DC.date.issued" content="YYYY(-MM(-DD))" />
<meta name="DCTERMS.issued" content="YYYY(-MM(-DD))" />

Creation Date
<meta name="Date of creation" content="YYYY(-MM(-DD))" />
<meta name="DC.date.created" content="YYYY(-MM(-DD))" />
<meta name="DC.date.creation" content="YYYY(-MM(-DD))" />
<meta name="DCTERMS.created" content="YYYY(-MM(-DD))" />

Modification Date
<meta name="Date of modification" content="YYYY(-MM(-DD))" />
<meta name="DC.date.modified" content="YYYY(-MM(-DD))" />
<meta name="DC.date.modification" content="YYYY(-MM(-DD))" />
<meta name="DCTERMS.modified" content="YYYY(-MM(-DD))" />

Type
<meta name="Type" content="GENRE or CLASSIFICATION" />
<meta name="DC.type" content="GENRE or CLASSIFICATION" />
<meta name="DCTERMS.type" content="GENRE or CLASSIFICATION" />

Format
<meta name="Format" content="MEDIA/FILE TYPE" />
<meta name="DC.format" content="MEDIA/FILE TYPE" />
<meta name="DCTERMS.format" content="MEDIA/FILE TYPE" />

Relation
<meta name="Relation" content="RELATED RESOURCE" />
<meta name="DC.relation" content="RELATED RESOURCE" />
<meta name="DCTERMS.relation" content="RELATED RESOURCE" />

Coverage
<meta name="Coverage" content="TIME, SPACE or OTHER SPAN" />
<meta name="DC.coverage" content="TIME, SPACE or OTHER SPAN" />
<meta name="DCTERMS.coverage" content="TIME, SPACE, or OTHER SPAN" />

Rights
<meta name="Rights" content="COPYRIGHT STATUS" />
<meta name="Copyright" content="COPYRIGHT STATUS" />
<meta name="DC.rights" content="COPYRIGHT STATUS" />
<meta name="DCTERMS.rights" contents="COPYRIGHT STATUS" />

Language
<meta name="DC.language" content="TWO-LETTER LANGUAGE CODE" />
<meta name="DCTERMS.language" content="TWO-LETTER LANGUAGE CODE" />

Source
<meta name="Source" content="SOURCE DERIVED FROM" />
<meta name="DC.source" content="SOURCE DERIVED FROM" />
<meta name="DCTERMS.source" content="SOURCE DERIVED FROM" />

EISBN
<meta name="EISBN" content="EISBN CODE" />
<meta name="DC.identifier" scheme="EISBN" content="EISBN CODE" />
<meta name="DC.identifier.EISBN" content="EISBN CODE" />
<meta name="DCTERMS.identifier.EISBN" content="EISBN CODE" />
<meta name="DCTERMS.identifier" scheme="EISBN" content="EISBN CODE" />

ISSN
<meta name="ISSN" content="ISSN CODE" />
<meta name="DC.identifier" scheme="ISSN" content="ISSN CODE" />
<meta name="DC.identifier.ISSN" content="ISSN CODE" />
<meta name="DCTERMS.identifier.ISSN" content="ISSN CODE" />
<meta name="DCTERMS.identifier" scheme="ISSN" content="ISSN CODE" />

ISBN
<meta name="ISBN" content="ISBN CODE" />
<meta name="DC.identifier" scheme="ISBN" content="ISBN CODE" />
<meta name="DC.identifier.ISBN" content="ISBN CODE" />
<meta name="DCTERMS.identifier.ISBN" content="ISBN CODE" />
<meta name="DCTERMS.identifier" scheme="ISBN" content="ISBN CODE" />

CustomID
<meta name="CustomID" content="CustomID CODE" />
<meta name="DC.identifier" scheme="CustomID" content="CustomID />
<meta name="DC.identifier.CustomID" content="CustomID />
<meta name="DCTERMS.identifier.CustomID" content="CustomID" />
<meta name="DCTERMS.identifier" scheme="CustomID" content="CustomID" />

DOI
<meta name="DOI" content="DOI CODE" />
<meta name="DC.identifier" scheme="DOI" content="DOI CODE" />
<meta name="DC.identifier.DOI" content="DOI CODE" />
<meta name="DCTERMS.identifier.DOI" content="DOI CODE" />
<meta name="DCTERMS.identifier" scheme="DOI" content="DOI CODE" />

==================

Any additional creator or contributor may be added using the over 200 MARC Relator Codes (http://www.loc.gov/marc/relators/relacode.html):

Illustrator
<meta name="DC.creator.ill" content="NAME" />

Proofreader
<meta name="DC.contributor.pfr" content="NAME" />

Editor
<meta name="DC.contributor.edt" content="NAME" />

Cover Designer
<meta name="DC.contributor.cov" content="NAME" />

==================

Please comment and correct: I will update this entry.

Thanks,

m a r

Hi,

I edited the list above to include the DCTERMS. namespace where it overlaps with the epub spec as well as use of refinements to hold schemes for identifiers.

As it stands now, EISBN is mapped to ISBN internally so using both at the same time would probably not be a good idea.

Hope this helps,

Kevin

rogue_ronin
12-27-2009, 05:47 PM
I find it very useful. Do you mind if I use this one day when I find the time to write a manual for Sigil?

Of course not, use it at will! You might want to break it into "free-form", DC and DCTERMS (see the additions that Kevin made.)

><meta name="OPF.file-as" content="LastName, First Middle" />
Yeah, this can't work. We need to have creator or contributor specified. No guessing.


I was thinking that it would be in addition to the DC tag -- you'd still have the definite creators/contributors. Just a way to mark the sort.

><meta name="DC.creator.aut" content="First Middle LastName" />
<meta name="DC.creator.aut" scheme="OPF:file-as" content="LastName First Middle" />
This too is a bad idea. I could make Sigil recognize that these two tags actually specify one author, but every Reading System on the planet is going to read it as two. Using just the second meta tag could work though.

Yeah. I get it. It's going to have to be something beyond the DC or DCTERMS spec. Which is why I suggested the OPF above.

In any event, as I've said, Sigil should store any meta tags it can't recognize and convert to DC. These should then be exported to the OPF as they were, that is, bare meta tags (the spec supports this).

Right. So coming up with a good, referenced set of tags for XHTML should give a good set of tags within any ePub created by Sigil. I can see where file-as really means nothing within the editor. It's only beyond the editor that it begins to have meaning.

I'm using what Sigil does as a jumping-off point to come up with a set of "best practices" with XHTML for myself and anyone interested.

m a r

rogue_ronin
12-27-2009, 06:14 PM
Hi,

I edited the list above to include the DCTERMS. namespace where it overlaps with the epub spec as well as use of refinements to hold schemes for identifiers.

As it stands now, EISBN is mapped to ISBN internally so using both at the same time would probably not be a good idea.

Hope this helps,

Kevin

Cool, that's great. Is there anything else that's currently possible, but not listed?

From your comment, I went and looked again at the DCTERMS list.

In my original list of custom metadata terms there were the following that didn't map to the DC list:


<meta name="FileName" content="FILENAME.EXT" />
<meta name="FileVersion" content="VERSION NUMBER/NAME" />
<meta name="FileScanner" content="NAME" />
<meta name="FileComment" content="COMMENT" />
<meta name="SubTitle" content="SUBTITLE" />
<meta name="Series" content="SERIES NAME" />
<meta name="SeriesNumber" content="SERIES SEQUENCE NUMBER" />
<meta name="PublicationCity" content="CITY NAME" />
<meta name="CopyrightHolder" content="NAME" />


And this one didn't map to the ePub spec:

<meta name="DC.date.copyrighted" content="YYYY(-MM(-DD))" />



What do you think of the following?


<meta name="DCTERMS.alternative" content="SUBTITLE" />
<meta name="DCTERMS.isPartOf" content="SERIES NAME" />
<meta name="DCTERMS.rightsHolder" content="NAME" />
<meta name="DCTERMS.dateCopyrighted" content="YYYY(-MM(-DD))" />


And maybe changing

<meta name="DC.rights" content="LICENSE" />

to

<meta name="DCTERMS.license" content="LICENSE" />
?

Still leaves me out in the cold with the following though:

<meta name="FileName" content="FILENAME.EXT" />
<meta name="FileVersion" content="VERSION NUMBER/NAME" />
<meta name="FileScanner" content="NAME" />
<meta name="FileComment" content="COMMENT" />
<meta name="SeriesNumber" content="SERIES SEQUENCE NUMBER" />
<meta name="PublicationCity" content="CITY NAME" />


m a r

rogue_ronin
12-27-2009, 06:17 PM
In any event, as I've said, Sigil should store any meta tags it can't recognize and convert to DC. These should then be exported to the OPF as they were, that is, bare meta tags (the spec supports this).

Does Sigil auto-generate the opf:file-as attribute when creating the OPF file?

m a r

Valloric
12-27-2009, 09:06 PM
Does Sigil auto-generate the opf:file-as attribute when creating the OPF file?

If you write for instance author as "Doe, John" then that will be used as file-as and "John Doe" will be used as the standard value. But notice the comma.

And if your epub file has creator/contributor file-as, then that is loaded instead of the value.

It's rudimentary, but supports about 90% of use cases.

rogue_ronin
12-27-2009, 11:14 PM
Got it. Don't think that will work for a general-case XHTML file, though.

Have to keep thinking on it.

m a r

rogue_ronin
12-28-2009, 05:06 PM
I've updated the Sigil list with KevinH's additions, here. (http://www.mobileread.com/forums/showpost.php?p=712544&postcount=9)

Still looking for suggestions and guidance on XHTML metadata encoding... :D

m a r

KevinH
12-29-2009, 01:51 PM
Hi,

Some thoughts, for what they are worth.

At first pass, I would try to stick with dc as much as possible when looking at extensions. Implementing an additional subset from the dcterms. namespace might be a good thing to do because the epub spec may grow to include them someday. And, using the DCTERMS.namespace means they can be easily mapped to "name", "content" pairs that can be stored and then passed through Sigil to the output opf file without really having to process them:

These might include things like:

DCTERMs.abstract - "a short summary of the resource"
although this may be superseded by DC.description

DCTERMS.alternative - "alternative title"


DCTERMS.audience - "audience the resource is intended for" (ie.. children, vs adult or PG-13 or Teens or ...)


Additional date events to record:

DCTERMS.dateAccepted - "date of acceptance of the resource"

DCTERMS.dateCopyrighted" - "date of copyright"

DCTERMS.dateSubmitted - "date of submission" (ie. for a thesis or dissertation")



Additional license qualifiers:

DCTERMS.license - "license to use the resource" (public domain, etc)

DCTERMS.provenance - "statement of changes in ownership"

DCTERMS.rightsHolder - "person or org who owns the rights"

DCTERMS.accessRights - "who can access it, security status"



And the following two fields that qualify "Coverage":

DCTERMS.spatial - "spatial or geographic coverage"

DCTERMS.temporal - "time period covered"

Although you could argue that the standard DC.coverage is enough



And only the most basic "Relation" qualifiers:

DCTERMS.hasPart, DCTERMS.isPartOf

The hasPart can be the number in the Series, and isPartOf can be the "Series" name itself


DCTERMS.hasVersion, DCTERMS.isVersionOf

To allow support for different versions of the same book. Think "Jules Verne's Journey to the Centre of the Earth" - the original English translation versus a modern translation to English directly from French. They are very very different - in fact many older translators took "extreme liberties" to "enhance" the book they were translating for their audience.


All of the above would only be "passed through" Sigil and not processed for editing and things.

Then I think we have to go outside the DCTERMS namespace but only for the select few that really matter most:

Something along the lines of

YOURNAMESPACEHERE.name content="something".

I think the smallest most relevant subset should be the goal and not trying to replace all of the information from the card catalog system would be best.

My 2 cents,

...

KevinH

KevinH
12-29-2009, 02:13 PM
Hi,

One other thing... I do not think we should keep track of concurrent versioning information in the meta data of the ebook.

So for example:

<meta name="FileName" content="FILENAME.EXT" />
<meta name="FileVersion" content="VERSION NUMBER/NAME" />
<meta name="FileScanner" content="NAME" />
<meta name="FileComment" content="COMMENT" />

These should not be in "released" eBooks. They should in fact be tracked by the concurrent versioning system used to keep track of editing changes and things *before* a version of the book is released.

For example, CVS, Mercurial, etc are all source code versioning systems that can be adapted to support concurrent editing and versioning of ebooks being worked on. That system would keep track of editorial changes, who made them, when they were made, the files changed, etc.

That information need not be part of the metadata of an "official release" of an eBook, in much the same way that the specific changes made to software, by whom, and when is not actually part of the information made when the software is released, it is kept internally only.


That said, I can see that many different organizations may make their own releases of the exact same public domain book, and as such, we do need to see the group doing the release.

If this fits under "DC.Publisher" then all is fine. If not, then we should probably add a specific "Generator" meta element to capture this infromation:

So something like.

name="Generator" content="org or person making the release"

as free form metadata, or

YOURNAMESPACEHERE.generator style.

Again, all of this is my 2 cents, feel free to ignore all of it.

Take care,

KevinH

rogue_ronin
12-29-2009, 11:40 PM
...At first pass, I would try to stick with dc as much as possible when looking at extensions. Implementing an additional subset from the dcterms. namespace might be a good thing to do because the epub spec may grow to include them someday. And, using the DCTERMS.namespace means they can be easily mapped to "name", "content" pairs that can be stored and then passed through Sigil to the output opf file without really having to process them...

Hmmm... well, the DCTERMS namespace contains equivalents of all the DC namespace terms, so it sorta makes sense to migrate to that, doesn't it? (And as you are grabbing all the common DC/DCTERMS terms in Sigil, too, it wouldn't be incompatible.) Any conversion to ePub would be super-easy if there were only the DCTERMS... and am I mistaken in thinking that if I were to do a proper namespace declaration in the head, that I could include both? (Probably necessitating redundancy, but that's not a big deal.)

I'm saying this, however, in an effort to come to the simplest, coherent list -- thus preferring to stick with one namespace, probably DCTERMS. Some of your suggestions below support this thought.

(I'm going to take some of your response out of order, here...)


One other thing... I do not think we should keep track of concurrent versioning information in the meta data of the ebook.

So for example:

<meta name="FileName" content="FILENAME.EXT" />
<meta name="FileVersion" content="VERSION NUMBER/NAME" />
<meta name="FileScanner" content="NAME" />
<meta name="FileComment" content="COMMENT" />

These should not be in "released" eBooks. They should in fact be tracked by the concurrent versioning system used to keep track of editing changes and things *before* a version of the book is released.


I can see your argument regarding versioning. It does make sense to keep the version info in the content management software. I keep it in my files because it is easy to grab that info along with everything else when I open a project.

And that's probably because I don't keep old versions, though I do keep a list of former versions and modifications in the metadata. I also include a "version guide" in the metadata, that shows what the versions actually mean (so that they don't just have an arbitrary, undefined "improved" value.)

EG:

<!-- BEGIN: FILE HISTORY -->

<!-- Created on 2009-12-28 -->
<!-- Revision # 0.10 on 2009-12-28 -->
<!-- Current Revision # 0.50 on 2009-12-28 -->

<!-- END: FILE HISTORY -->

<!-- BEGIN: REVISION GUIDELINE -->

<!--
0.10 :: Initial Conversion
0.20 :: Cover and Frontispiece
0.30 :: Sections, Chapters and TOC
0.40 :: Endnotes and/or Blockquotes
0.50 :: Initial Spellcheck
0.60 :: M-Dashes, Hyphens and Ellipses
0.70 :: Italics, Bold, and Pre-Formatted Text
0.80 :: Reading Proof
0.90 :: Checked Against Canonical Source
1.00 :: Final Version = Optimal
1.++ :: Minor Error Corrections
-->

<!-- END: REVISION GUIDELINE -->


Still, having a modified date in the metadata is kind of the same thing, isn't it -- it just doesn't give you a sense of progress, or perfectedness, does it?

If one's use-model includes people sharing files and improving them (as mine does), modified dates may not be enough though, to indicate the relative value of individual files. Sort of like software version numbers -- those are a ready measure of a type of value. It's also something that happens "in the wild."

FileNames may not matter -- it can usually be determined by a system call of some sort (but is there a case for knowing the original filename? It might allow for auto-renaming of related files such as images... Too speculative?)

FileScanner will probably have to wait for a MARC Relator Code to catch up to reality. :D

And FileComment -- I guess I keep thinking how and/or why a file/book has been created might be interesting or relevant (to research, or somesuch.) It's something that I use to auto-generate a Colophon, too, where such info is often expected. Don't Project Gutenberg texts often include such comments?

Note that a lot of my reasoning has to do with the file being encountered by someone Not-The-Producer, and suggesting to that N-T-P ways to keep a good, comprehensible accounting of their own work, as well as giving them as excellent a context as possible for understanding the current file.

Regarding your observations on the DCTERMS namespace:

These might include things like:

DCTERMs.abstract - "a short summary of the resource"
although this may be superseded by DC.description

DCTERMS.alternative - "alternative title"

DCTERMS.audience - "audience the resource is intended for" (ie.. children, vs adult or PG-13 or Teens or ...)

Additional date events to record:

DCTERMS.dateAccepted - "date of acceptance of the resource"

DCTERMS.dateCopyrighted" - "date of copyright"

DCTERMS.dateSubmitted - "date of submission" (ie. for a thesis or dissertation")

Additional license qualifiers:

DCTERMS.license - "license to use the resource" (public domain, etc)

DCTERMS.provenance - "statement of changes in ownership"

DCTERMS.rightsHolder - "person or org who owns the rights"

DCTERMS.accessRights - "who can access it, security status"

And the following two fields that qualify "Coverage":

DCTERMS.spatial - "spatial or geographic coverage"

DCTERMS.temporal - "time period covered"

Although you could argue that the standard DC.coverage is enough

And only the most basic "Relation" qualifiers:

DCTERMS.hasPart, DCTERMS.isPartOf

The hasPart can be the number in the Series, and isPartOf can be the "Series" name itself

DCTERMS.hasVersion, DCTERMS.isVersionOf

To allow support for different versions of the same book. Think "Jules Verne's Journey to the Centre of the Earth" - the original English translation versus a modern translation to English directly from French. They are very very different - in fact many older translators took "extreme liberties" to "enhance" the book they were translating for their audience.

All of the above would only be "passed through" Sigil and not processed for editing and things.

I think this is really good stuff. I'm going to take the above ideas and see if I can generate a set of (relatively) simple meta tags in the next post I make.

I'll separate it into a variant of my current scheme, and some possible additions based on your suggestions here.

Then I think we have to go outside the DCTERMS namespace but only for the select few that really matter most:

Something along the lines of

YOURNAMESPACEHERE.name content="something".

I think the smallest most relevant subset should be the goal and not trying to replace all of the information from the card catalog system would be best.


I'm with you there, but making our own namespace! I just started XHTML a few months ago... :) And if we're going to make our own namespace, we should just do everything to our own satisfaction, and ditch the DC stuff (or rather, steal freely and scratch off the serial numbers.)

From my list there's not much remaining, though, so it probably isn't necessary, unless you have some further suggestions. I don't want to recreate the entire card catalog, either, but we probably already have done most of one via DC/DCTERMS.

That file-as attribute is still a kneebiter, though.

Give me your comments on the next post if you still have interest! (That post will be up in an hour or two, I think.)

m a r

rogue_ronin
12-30-2009, 02:31 AM
These are the ones that I currently think are important enough to include:

Identifier
<meta name="DCTERMS.identifier" scheme="SCHEME NAME" content="SCHEME CODE" />

Title
<meta name="DCTERMS.title" content="TITLE" />

Author
<meta name="DCTERMS.creator.aut" content="NAME" />

Series Name
<meta name="DCTERMS.isPartOf" content="SERIES NAME" />

Series Number
<meta name="DCTERMS.hasPart" content="SERIES NUMBER" />

Type
<meta name="DCTERMS.type" content="GENRE or CLASSIFICATION" />

Subject
<meta name = "DCTERMS.subject" content="KEYWORD(S)" />

Description
<meta name="DCTERMS.description" content="DESCRIPTION OF CONTENT" />

Publisher
<meta name="DCTERMS.publisher" content="PUBLISHER DATA" />

Publication Date
<meta name="DCTERMS.issued" content="YYYY(-MM(-DD))" />

Creation Date
<meta name="DCTERMS.created" content="YYYY(-MM(-DD))" />

Modification Date
<meta name="DCTERMS.modified" content="YYYY(-MM(-DD))" />

Copyright Date
<meta name="DCTERMS.dateCopyrighted" contents="YYYY(-MM(-DD))" />

Copyright Holder
<meta name="DCTERMS.rightsHolder" contents="NAME/ORG." />

Copyright Status
<meta name="DCTERMS.license" contents="LICENSE/STATUS" />

Language
<meta name="DCTERMS.language" content="TWO-LETTER LANGUAGE CODE" />

Source
<meta name="DCTERMS.source" content="SOURCE DERIVED FROM" />

==================

Any additional creator or contributor may be added using the over 200 MARC Relator Codes (http://www.loc.gov/marc/relators/relacode.html):

Illustrator
<meta name="DCTERMS.creator.ill" content="NAME" />

Proofreader
<meta name="DCTERMS.contributor.pfr" content="NAME" />

Editor
<meta name="DCTERMS.contributor.edt" content="NAME" />

Cover Designer
<meta name="DCTERMS.contributor.cov" content="NAME" />

==================

Extensions that don't meet the DC spec, but do meet the ePub spec:

File-As
<meta name="DC.creator.aut" scheme="FileAs:Lastname, First Middle" content="Dr. First Middle Lastname, Esq." />
-- Part of the ePub spec, but generally useful to define document sorting. The scheme attribute will be ignored by any parser as an unknown scheme.

==================

Others that maybe SHOULD be included (please make an argument against):

==

Abstract
<meta name="DCTERMS.abstract" content="SUMMARY OF CONTENT" />

-- I could be talked into this one. More useful for non-fiction.

==

Alternative Title
<meta name="DCTERMS.alternative" content="TITLE" />

-- Alternate Title, like a foreign name, or earlier (maybe offensive) name. This is more common than I originally thought.

==

Audience
<meta name="DCTERMS.audience" content="INTENDED AUDIENCE" />

-- Like age-ranges, or... something else? "Young Adult" is a really popular category at the moment.

==

==================

Others that should maybe NOT be included (please make an argument in favor):

Format
<meta name="DCTERMS.format" content="MEDIA/FILE TYPE" />

--I'm of a mind that it being an eBook, you're already pretty sure of the media and/or filetype.

==

Relation
<meta name="DCTERMS.relation" content="RELATED RESOURCE" />

-- The two refinements of this that allow us to keep Series Name and Series Number seem adequate.

==

Coverage
<meta name="DCTERMS.coverage" content="TIME, SPACE, or OTHER SPAN" />

-- Meh. Subject seems enough.

==

Provenance
<meta name="DCTERMS.provenance" content="OWNERSHIP HISTORY" />

-- I think this is about the actual, physical resource. I'm not sure it's relevant to an ebook.

==

Access Rights
<meta name="DCTERMS.accessRights" content="PERMISSION(S) TO ACCESS" />

-- Things like age restrictions, etc. Talk me into it. It'll be hard, I'm against most restrictions, even normal ones.

==

Date of Acceptance
<meta name="DCTERMS.dateAccepted" content="YYYY(-MM(-DD))" />

-- Some certifying authority acknowledges receipt/acceptance of a document. Meh.

==

Date of Submission
<meta name="DCTERMS.dateSubmitted" content="YYYY(-MM(-DD))" />

-- Some certifying authority is given a document. Double-meh.

==

Geographical/Spatial Coverage
<meta name="DCTERMS.spatial" content="SPATIAL RANGE" />

-- Seems unnecessarily redundant of the Coverage tag.

==

Date/Temporal Coverage
<meta name="DCTERMS.temporal" content="TEMPORAL RANGE" />

-- Also seems unnecessarily redundant of the Coverage tag.

==

Has Version
<meta name="DCTERMS.hasVersion" content="TITLE/NAME" />

-- Indicates another resource that is adapted from this one.

==

Is Version Of
<meta name="DCTERMS.isVersionOf" content="TITLE/NAME" />

-- Indicates a resource that this resource was adapted from.

==================

Undefinable by DCTERMS, but possibly desired metadata:

File Name
-- The original name of the eBook file.

File Version
-- Using a defined versioning scheme. It's also a bit like a "#th Printing" statement.

File Comment
-- Information about how/why the ebook file was created.

Sub-title
-- Lots of books have these.

Publication City
-- Commonly used. Might be growing less relevant in the digital age.

==================

I'm always open to input, corrections and suggestions!

There are other ways to code this, but I'm looking for a relatively simple, consistent method that covers most everything. The DCTERMS namespace seems to be that method, as the DC namespace is more limited and requires a somewhat vague extension ("refinements").

Also, all the DCTERMS can be defined this way in XHTML, but the questions here are: What is generally useful for eBooks? What are absolutely necessary, what are not?

I'll update this post as it gets better defined.

m a r

ps: huge props to KevinH!

KevinH
12-30-2009, 12:34 PM
Hi,

Nice list ...

Two things. The DC namespace is what the main epub spec is built on and both DCTERMS. and DC. and refinements are already supported where they overlap with the epub spec so your first list of valid metadata recognized now is still the main one.

So all we need think about is what to **add** to the epub spec and it looks like you agree with me that using DCTERMs. as the main source of these additions is the way to go.

2. So if we just look at the additions to what is already covered by Sigil/ and the epub spec you are suggesting the following, is that correct?

Series Name
<meta name="DCTERMS.isPartOf" content="SERIES NAME" />

Series Number
<meta name="DCTERMS.hasPart" content="SERIES NUMBER" />

Copyright Date
<meta name="DCTERMS.dateCopyrighted" contents="YYYY(-MM(-DD))" />

Copyright Holder
<meta name="DCTERMS.rightsHolder" contents="NAME/ORG." />

Copyright Status
<meta name="DCTERMS.license" contents="LICENSE/STATUS" />


and then from non-dc / non-dcterms you are suggesting we add the following:

File Name
-- The original name of the eBook file.

File Version
-- Using a defined versioning scheme. It's also a bit like a "#th Printing" statement.

File Comment
-- Information about how/why the ebook file was created.

File-As
-- Part of the ePub spec, but generally useful to define document sorting.

Sub-title
-- Lots of books have these.

Publication City
-- Commonly used. Might be growing less relevant in the digital age.


Is that correct?

If so, I think that is a good list. Perhaps we could encourage others to add their two cents and see what they think.

I wish there was a way to "advertise" this topic to all people interested in book metadata on this forum to get more input.

Thanks,

Kevin

Valloric
12-30-2009, 01:01 PM
So all we need think about is what to **add** to the epub spec and it looks like you agree with me that using DCTERMs. as the main source of these additions is the way to go.

I just want to say that anything you want to *add* has to be already valid as per the spec. If you merely want to add <meta> tags to the OPF, that's fine by me since the spec says they can have whatever format they like (key--value pairs).

But anything beyond that I don't support.

KevinH
12-30-2009, 01:23 PM
Hi Valloric,

Understood. By **add** I only meant over and above what Sigil/epub already supports, **not** that we would add additional things to the epub spec.

The plan is, after getting more input on what is useful, I would submit changes to you to approve that just pass through to the opf file (and reading back in if Sigil loads an epub) all of these additions so that they would not be lost or ignored as they would be now.

So then the docs would eventually highlight the metadata that is fully supported by Sigil and the epub spec (see the earlier post), and then a set of recommendations for additions to use that will only be passed through to the opf file so that they not be lost or ignored.

Then people who create metadata inside html files and for ebooks, can at least know what will be supported and what will simply be retained versus what will be ignored or lost.

Sound good?

KevinH

Valloric
12-30-2009, 01:27 PM
Then people who create metadata inside html files and for ebooks, can at least know what will be supported and what will simply be retained versus what will be ignored or lost.

Sound good?

Very good, yes.

rogue_ronin
12-30-2009, 05:05 PM
@Kevin: Yeah, those are the basic additions I'm looking at. There's also some stuff in the second section that I think might be good to use as well, but I'm not certain. I'd like it if more folks chimed in, too, but I suspect it takes a rare type of OCD to work on this stuff!

I'm trying to work out a set of basic, but reasonably thorough, XHTML metadata, preferably in a Dublin Core format, that is consistent with what Sigil uses or recognizes (because Sigil is the first app to take such a thing seriously.)

Technically, Sigil supports (or will support) the entire DC, because it will pass through all valid <meta> tags. So, technically, there's nothing to discuss in that area. But, of course, figuring out what is actually useful when creating an eBook, and putting together a simple list of what to use (from the myriad possibilities) is where this thread should work itself out. This XHTML eBook metadata stands somewhat apart from whatever form it may take later (particularly in a Sigil ePub.)

The most recent list is using entirely DCTERMS because it's consistent, is a superset of the DC namespace, and enables us to encode a larger set of metadata in a more specific way. The suggestions you made were spot-on; all I did was package them up nicely.

Since Sigil looks for DCTERMS as well as DC, there's no reason to mix different namespaces in this recommendation/spec. While Sigil's output will be only valid ePub spec, and thus may use the DC namespace, there's no reason to limit the input to that space since there is logic built into it to recognize a larger set of metadata -- and the resulting XHTML is simpler, more readable and consistent. (Makes it look like some actual thought went into it!)

As you've recognized, what Sigil understands on input, and what might be available in the metadata, are different lists. Someone could make a simple list of free-form terms to use; in fact, for everything that matches the ePub spec, it would be nice if there were a Sigil-specific free-form list.

Now, as to the stuff that cannot be matched to DCTERMS: simple enough, really -- just turn 'em into basic <meta> tags...

File Name
<meta name="FileName" content="FILENAME.EXT">

File Version
<meta name="FileVersion" content="VERSION NUMBER">

File Comment
<meta name="FileComment" content="COMMENT">

File-As
<meta name="FileAs" content="LASTNAME, FIRST MIDDLE">

Sub-title
<meta name="SubTitle" content="SUBTITLE">

Publication City
<meta name="PublicationCity" content="CITY NAME">

(I think we're getting new ePub spec this year -- maybe some of these will be included. I'm hoping for "Sub-title.")

I'd love to hear if someone can think of a way to map these to the DCTERMS. I'm also open to further arguments against them. I may be married to FileName, for instance, because I'm using it in my process so much.

@Valloric: Of the above, Sigil will largely just pass them through to the OPF: the only question is, is it reasonable that Sigil should recognize the File-As tag (much as it recognizes Author or Title)? There should only be one such tag, so it could sensibly be mapped to the primary creator.

On the other DCTERMS in the prior list: I'd love to hear some arguments, particularly for Abstract, Alternative Title and Audience. I tend to come from a fiction-book perspective, and might need some schooling on non-fiction.

m a r

Valloric
12-30-2009, 06:31 PM
the only question is, is it reasonable that Sigil should recognize the File-As tag (much as it recognizes Author or Title)? There should only be one such tag, so it could sensibly be mapped to the primary creator.

Who says? I'm sorry, but you can't guess.

rogue_ronin
12-30-2009, 07:01 PM
<shatner>Damn you, File-As!!!</shatner>

:angry:

m a r

rogue_ronin
12-31-2009, 11:27 PM
Hmm...

Any thoughts on this:


<meta name="Dr. First Middle Lastname, Esq." scheme="FileAs" content="Lastname, First Middle" />


It's the world's simplest scheme.

m a r

KevinH
01-01-2010, 01:38 AM
Hi,

The problem is you can not see what role that the Dr. First Middle Lastname, Esq plays? Is he an author? Is he a contributor?

How about the following instead:

<meta name="DC.creator.aut" scheme="FileAs:Lastname, First Middle" content="Dr. First Middle Lastname, Esq." />

I know this is not DC spec but it could be used to create ePub spec which is what is important here.

Checking for scheme on creator and contributor and searching that string for FielAs: and removing that prefix could be easily done.

But this would have to be Valloric's call.

My 2 cents,

KevinH

rogue_ronin
01-01-2010, 02:25 AM
Hi,

The problem is you can not see what role that the Dr. First Middle Lastname, Esq plays? Is he an author? Is he a contributor?

Shouldn't matter. I see it as information about how to sort the name no matter what role you find it in. Let's say you have the following in your <head>:

Author
<meta name="DCTERMS.creator.aut" content="Dr. First Middle LastName, Esq." />

Illustrator
<meta name="DCTERMS.creator.ill" content="Dr. Foist Myrtle LostName, M.D." />

Proofreader
<meta name="DCTERMS.contributor.pfr" content="Herr Doktor Fear St. Muddle Las Nymen" />

Editor
<meta name="DCTERMS.contributor.edt" content="Prof. F. M. Lass-Gnome" />

Cover Designer
<meta name="DCTERMS.creator.cvr" content="Dr. First Middle LastName, Esq." />


And you also have what I suggested above:


<meta name="Dr. First Middle Lastname, Esq." scheme="FileAs" content="Lastname, First Middle" />


Then you have the file-as attribute for the first and last -- the author and the cover designer (who either have exactly the same name or are the same person) because they are the only ones that match. It's basically metadata about metadata.

How about the following instead:

<meta name="DC.creator.aut" scheme="FileAs:Lastname, First Middle" content="Dr. First Middle Lastname, Esq." />

I know this is not DC spec but it could be used to create ePub spec which is what is important here.

Checking for scheme on creator and contributor and searching that string for FielAs: and removing that prefix could be easily done.

But this would have to be Valloric's call.


Hmm... He's already parsing for inverted names. I assume that what you're proposing is sort of sneaking in a second content field?

Is it valid XHTML? And I'm not sure that we should break the DCTERMS spec -- at least where terms that occur in the spec are concerned. Anywhere else is fine.

Isn't it fun being a dork on New Year's Eve?

m a r

Valloric
01-01-2010, 10:10 AM
The problem is you can not see what role that the Dr. First Middle Lastname, Esq plays? Is he an author? Is he a contributor?

We don't know, and we would have to guess, and no guessing.

<meta name="DC.creator.aut" scheme="FileAs:Lastname, First Middle" content="Dr. First Middle Lastname, Esq." />

I know this is not DC spec but it could be used to create ePub spec which is what is important here.

I like this. It doesn't have to be DC spec, it's still valid in an epub document even as a meta tag, and it's also valid XHTML. Any other DC parser would just use the name and content attributes and ignore the scheme since it couldn't recognize it.

This solution seems rather perfect.

rogue_ronin
01-01-2010, 05:33 PM
By guessing, do you mean comparing? I was proposing exact matches only.

Nonetheless, if it's valid XHTML, I can pretty easily go along with Kevin's proposal.

I'll mod the list(s) later today. Will it be parsed in Sigil?

m a r

Valloric
01-01-2010, 05:36 PM
By guessing, do you mean comparing? I was proposing exact matches only.

But you have to take into account the other applications besides Sigil. They will all see it as different metadata, and Kevin's proposal would be parsed as one instance of metadata.

Valloric
01-01-2010, 05:38 PM
I'll mod the list(s) later today. Will it be parsed in Sigil?

The lists you wrote before outlining what currently works in Sigil should stay. This new proposal is still merely a proposal and would need to be implemented.

I'm guessing Kevin would like to do this when he finds the time, but that's entirely up to him.

rogue_ronin
01-01-2010, 06:40 PM
But you have to take into account the other applications besides Sigil. They will all see it as different metadata, and Kevin's proposal would be parsed as one instance of metadata.

I thought I had. It would simply be custom metadata, in (presumably) proper XHTML.

It might not recognize it, as in the example Kevin proposed, but it wouldn't break anything. I actually thought it was sort of elegant -- but I'm no expert.

I think, though, that I understand your implied point: since it won't be recognized either way, just use a single tag.

I'll update the proposed list; let me know when the Sigil list needs updating, and I'll take care of it.

Thanks!

m a r

LARdT
10-11-2010, 09:23 PM
Thanks for the list it is very useful. I don't have problems with the syntax but with the "meaning" of the metadata field sometimes. I mean which is the proper kind of information to be included in each field because sometimes I find the definition somehow confusing.

For example, I keep a version tracking of my ePUBs which I try to enhance.
Which field would the most appropiate to keep that version info?

I use the date as version number: 20101011 (made this year in Oct, the eleventh)