View Full Version : Filename to Metadata


simobk
05-20-2013, 06:11 AM
Hi all,

OK, I've been using PDF as my ebook format until now. I read a bit and decided to start using ePub from now on.

I name ALL my ebooks : Author - Title (year).pdf

Is there anyway to generate the metadata from the file?

For example, I had written this little script to do this with my PDF's :

// Get the current filename
var fullName = this.documentFileName;
// Extract author, title and year
var author = fullName.slice(0,fullName.indexOf(" - "));
var title = fullName.slice(fullName.indexOf(" - ")+3, fullName.indexOf(" ("));
var year = fullName.slice(fullName.indexOf(" (")+2, fullName.indexOf(").pdf"));
// Insert metadata
this.info.Author = author;
this.info.Title = title;


In case there is no such tool, I played a bit with the files and realized they are actually ZIP file. After extracting them, I found the content.opf file which is actually an XML file.

I should be able to write me a little script that changes the metadata for me, I just want confirmation from the more experienced users about this :

Is content.opf the only file to edit?
Is all the metadata contained in the <metadata> tag?
I am under the impression that the only "standard" tags are the ones starting with <dc:...> and everything else is editor specific. Please confirm?
Is there always a cover.jpg inside the files?


Thanks for any and all help!

Simo

Toxaris
05-20-2013, 06:52 AM
Ad 1: yes
Ad 2: yes
Ad 3: not by default. There are more. The Dublin Core is used. Check that for full specs. You can also check the IDPF site for the official ePUB specs. I would advise the ePUB2 version, as that one is generally used.
Ad 4: no, that is not required. Only author, title and language are hard required.

Be aware that it is a special zip file. Packing must be done according to certain rules, or the result will no longer be an ePUB file.

simobk
05-20-2013, 07:25 AM
Hi Toxaris and thanks for the answer. :)

First of all, I guess the fact that you're answering these questions means that there is no automated tool yet :)

I googled a bit and read a bit more (http://www.hxa.name/articles/content/epub-guide_hxa7241_2007.html#contentopf). This is what I came up as being "correct" metadata that contains the fields I am interested in :
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:opf="http://www.idpf.org/2007/opf">
<dc:creator opf:role="aut" opf:file-as="Arthur Conan Doyle">Arthur Conan Doyle</dc:creator>
<dc:title opf:file-as="A Study in Scarlet">A Study in Scarlet</dc:title>
<dc:date>1887</dc:date>
<dc:subject>Detective, Crime, Mystery, Novel</dc:subject>
<dc:description>A Study in Scarlet is a detective mystery novel written by Sir Arthur Conan Doyle, introducing his new character of Sherlock Holmes, who later became one of the most famous literary detective characters.</dc:description>
<dc:language>en</dc:language>
<dc:identifier id="BookId">urn:uuid:9ef8ecb0-c134-11e2-8b8b-0800200c9a66</dc:identifier>
</metadata>

Does that look to you like correct metadata? I'm specially wondering if I am putting the right stuff in subject and description? I also couldn't find "standard" separators for the subjects?

I will take a look later at the ePub 2.0.1 specs (http://www.idpf.org/epub/20/spec/OPF_2.0.1_draft.htm). I'll try to find a more "to the point" source though! :D

Ad 4: no, that is not required. Only author, title and language are hard required.
I guess that means I'll have to use conditional statements, if there's a "cover.jpg", use it as a cover, otherwise, use the first image that appears as the cover. I'll read more about it later as I am under the impression there needs to be an associated xhtml.

Packing must be done according to certain rules, or the result will no longer be an ePUB file.
Care to develop?

There goes my hope for a "quick" solution! :D

pdurrant
05-20-2013, 08:36 AM
Hi Toxaris and thanks for the answer. :)

First of all, I guess the fact that you're answering these questions means that there is no automated tool yet :)


I'm not quite certain what it is you want to do. If your current books are PDFs, converting them to ePubs is going to be error-prone. The Meta-data will be the least of your worries!

However, I beleive that it might be possible to configure calibre to extract metadata from imported file names.

But if you want to create your own ePubs using a tool that you write, you're going to need to delve into the specifics of the ePub format, and the best place is the idpf web site, since they originated the specifications for ePub.

Toxaris
05-20-2013, 10:00 AM
You can always load your ePUB in Sigil. There is a simple metadata editor there and it is also possible to set the cover right. If no cover is specified, a lot of readers will take the first page as cover.

Pdurrant is right, there is no good tool to convert from pdf to ePUB. There are a lot of mediocre tools for the conversion. Depending on the PDF, expect a lot of post work to clean everything up.

simobk
05-20-2013, 10:07 AM
I do understand the difference between the formats, so no, I am not converting PDF's to EPUBS. I am little by little redownloading all of them (most of them are 100+ year old book not copyrighted anymore)

Thank you for your answers... I played with some files, and I realize it is way too complicated for a script as not all the metadata can be in the filename. I guess I will end up doing it manually over a few weeks. :)