Configure metadata from pdf import by Calibre

iostrym · 04-02-2015, 06:06 PM

I would like that when a pdf is added in calibre, a specific metadata from PDF is read and added in ISBN metada in calibre.
Like it is done for epub where ISBN epub is paste in ISBN calibre metadata. In calibre I don't think this could be configured...
I'm using ISBN as a version number to trace my evolution in PDF or epub. But maybe there is a better solution.

By the way, what is the rule for Calibre for currently import pdf metadata ?

I notice that :

first word of pdf subjet + words in pdf keywords are used as calibre TAG. It is quite strange that first word of pdf subject is taken like that. Also I don't succeed to put several word in pdf "keyword" because there are seen as only one keyword. Even with "," or ";" or " "

ex :
toto, titi
toto; titi
etc... each time the calibre corresponding tag is "toto, titi" or "toto; titi"

[EDIT] My fault, it is my pdf editor (pdf x change viewer) that add " " before and after all keywords... don't know why.

is the official calibre metadata behavior when importing a pdf documented somewhere ?

thanks a lot for your help

BetterRed · 04-02-2015, 06:55 PM

Quote:

Originally Posted by iostrym

I would like that when a pdf is added in calibre, a specific metadata from PDF is read and added in ISBN metada in calibre.
Like it is done for epub where ISBN epub is paste in ISBN calibre metadata. In calibre I don't think this could be configured...
I'm using ISBN as a version number to trace my evolution in PDF or epub. But maybe there is a better solution.

By the way, what is the rule for Calibre for currently import pdf metadata ?

I notice that :

first word of pdf subjet + words in pdf keywords are used as calibre TAG. It is quite strange that first word of pdf subject is taken like that. Also I don't succeed to put several word in pdf "keyword" because there are seen as only one keyword. Even with "," or ";" or " "

ex :
toto, titi
toto; titi
etc... each time the calibre corresponding tag is "toto, titi" or "toto; titi"

[EDIT] My fault, it is my pdf editor (pdf x change viewer) that add " " before and after all keywords... don't know why.

is the official calibre metadata behavior when importing a pdf documented somewhere ?

thanks a lot for your help

@iostrym - you could use this optional plug-in ==>> Extract ISBN

The ISBN is not a discrete metadata property like Title, Author and Tags/Keywords/Subjects etc so its not easily obtained during the add process. The Extract ISBN PI searches the book contents, IIRC first few and last few pages, for something resembles an ISBN. And I think it will search more than format.

BR

Moderator Notice
I moved the thread to a more suitable sub-forum - BR

iostrym · 04-02-2015, 07:00 PM

thank for your answer. But in fact what I'm trying to do is to set a version number like "V1" in a metadata of my pdf and I would like calibre when importing my pdf display my version number V1 in the ISBN column.
because I'm using ISBN metadata like a version number.
For epub there is no problem because isbn metadata is correctly imported from the epub but for pdf as this metadata doesn't exist...

and concerning my keyword problem, it could be great if we could configure in calibre that not only "," can separate keywords but also ";" or others symbols. Because my ### pdf editor want only separate keywords with ";" ... but maybe there is a plugin for this...

BetterRed · 04-02-2015, 07:17 PM

The ISBN column is not a discrete column - its a an entry in the the IDs column, which is a list, In that list ISBN looks something like 'isbn:1234567890123', an Amazon id looks something like 'asin:123456789' etc.

As I understand matters, the 'only sensible' way to get values into the IDs columns is via a metadata download plugin, and that the calibre IDs column implements the Dublin Core identifiers element. I am not sure that PDF metadata is DC compliant, unless its embedded as XMP, I suspect PDF XCHange doesn't 'support' XMP encoded metadata - at least not the free version.

Why don't you create a custom column for your version number, using the built in columns for other than there designated purposes (especially the IDs column) is a) not a good idea -- if for no other reason that it's usually a poor design decision, and b) it is unnecessary given how easy it is to create your own custom columns.

BR

iostrym · 04-02-2015, 07:41 PM

Thanks a lot for taking time to answer me. It seems I'm not clear enough. My problem has nothing to do with metadata download from ISBN value.

I'm editing PDF and epub outside calibre. When one epub or PDF is modified I have to delete the old in calibre and import the new one. So I would like that all metadata from the new PDF to be automatically reimported in calibre (tag, author and so on)
In epub i write, I use ISBN to store a version number this way the ISBN from my epub is recognized in calibre. And it works.
Is there a solution for calibre to do the same as there is no official metadata for ISBN in PDF ?
I'm missing a documentation on how calibre import metadata from PDF in calibre.
And on internet everyone is interested in calibre to write metadata in PDF from calibre metadata whereas I want to do exactly the opposite: configure the way calibre import metadata from a PDF...

Am I more understandable ?

Regards

theducks · 04-02-2015, 10:01 PM

I think you can roll your own Id

Ver:100

The colon is the ID TYPE separator
you can only have 1 of any Type in a book, buy you can have many types assigned

BetterRed · 04-02-2015, 10:18 PM

Quote:

Originally Posted by iostrym

Thanks a lot for taking time to answer me. It seems I'm not clear enough. My problem has nothing to do with metadata download from ISBN value.

I'm editing PDF and epub outside calibre. When one epub or PDF is modified I have to delete the old in calibre and import the new one. So I would like that all metadata from the new PDF to be automatically reimported in calibre (tag, author and so on)
In epub i write, I use ISBN to store a version number this way the ISBN from my epub is recognized in calibre. And it works.
Is there a solution for calibre to do the same as there is no official metadata for ISBN in PDF ?
I'm missing a documentation on how calibre import metadata from PDF in calibre.
And on internet everyone is interested in calibre to write metadata in PDF from calibre metadata whereas I want to do exactly the opposite: configure the way calibre import metadata from a PDF...

Am I more understandable ?

Regards

@iostrym - I need to apologise to both yourself and Tracker Systems. PDF XChange (free) does support XMP metadata - in the sense that it can display it. I didn't click enough buttons

I checked a number of my PDFs, and none have any dc:identifier elements - although they have other dc elements such as creator, language, title.

Try this - in Edit Metadata select the PDF format, and click Blue button, it does update other columns but as I don't have any PDF's with identifiers...

Click image for larger version

Name: Capture.JPG
Views: 797
Size: 45.0 KB
ID: 136723

Also see ==>> ebook-meta and calibredb set_metadata commands,

Curious - what are using to update the dc:identifiers in the PDF?

BR

iostrym · 04-03-2015, 03:30 AM

I don't use dc identifier metadata in pdf. Are you saying that I could fill automatically a column in calibre using dc identifier metadata (xmp metadata) written by a PDF editor?
Currently I put version (ex : v1) in subject metadata of PDF (first line of subject) then import PDF in calibre and manually cut past the version (v1) in subject to fill the identifier metadata in calibre (isbn:v1)
Just to be clear my final goal is to have nothing to do (manually in GUI) in calibre after importing all my PDF and epub in it and display a column with a version number.

BetterRed · 04-03-2015, 05:26 AM

I don't understand why you want to carry it as a badly formed ISBN in the IDs - what are the advantages

ISBN numbers have an internationally agreed format, they are issued by the appropriate national authorities and if you click on them in Book Details they do a look up on the Online Computer Library Centre's World Catalog site.

If you want your version number represented as an a identifier then why not label it as myversion:. But I cant see the advantages of that over a simple custom column called myversion - and its much easier to display in the Book List, book jacket, catalogue etc, than using a template every time to extract it from the identifiers list.

IMO the only ways you might achieve your long term goals are a) to use the command line programs and some scripting, or create your own version of calibre from source.

BR

iostrym · 04-03-2015, 10:53 AM

thanks for helping.

I use ISBN because when creating epub with write2epub I can write in ISBN metadata the version and it is automatically recognize by calibre if I display the column ISBN. And also because I have my own book and don't use at all the ISBN feature. that's why I was focused on ISBN metadata with pdf.

But I could use another one (simple custom column called myversion) if I know how to fill this column automatically by calibre when importing an epub or a PDF (from version stored in a pdf or epub metadata)

- I wont create a new version of calibre.

- Why not using command line program and some scripting. for example : I add a lot of pdf and I run a tcl script that refresh calibre metadata from imported pdf.

let's ignore epub version becase it works simply using ISBN.

For Pdf you sent me some TCL commands but I don't know if I could use them because :
ebook-meta : read only a selection of metadata so no "custom" metadata
set-metadata : write a metadata in the opf file (calibre metadata) but this feature is already included in the ebook-meta command.

you were speaking about XMP metadata in pdf, can Calibre read them ? even if it is custom metadata ?

BetterRed · 04-03-2015, 05:01 PM

I mentioned the 2 command line programs to remind you of their existence - only you can determine if they're relevant to your workflow.

As far as I can tell PDF XChange can display the xmp metadata, see File->Document Properties->Additional Metadata->Advanced. The DC elements are in the http://purl.org/dc/elements/1.1 section. But I don't think it can change or add new elements directly. Given PDF and XMP are Adobe creations I would assume one or more of their tools can populate any DC element in the EPUB.

I'm not sure if calibre reads the XMP data in the PDF when its adding books, why don't you do a test

add some Identifiers to a library book that has a PDF,
use calibre Embed Metadata (ctrl/e) to embed the metadata in the book,
verify it's there with PDF XChange,
save the PDF somewhere,
delete the book from calibre library,
add the saved PDF with Add options set to extract metadata from file

Are the Identifiers there?

I think Calibre will embed custom columns as XMP in an PDF, in a calibre section. But I don't know if it will read it back when a book is added - I suspect not because everyone's custom columns are different. Again why don't you try it out.

If you were to put your version number into say the Title eg "Robinson Crusoe : Version 123" then after adding books you could use Bulk Metadata Edit: Search and Replace: Regular Expression facilities to extract the version number from the Title, put it in a custom column and remove it from Title - S&R definitions can be saved, to facilitate reuse. That's how many (most) people tackle this sort of issue.

BR

iostrym · 04-04-2015, 02:12 AM

Thanks a lot. I will look info your title solution. Maybe without search and replace it will work as for ISBN:v1 were calibre is able to extract automatically v1 in a ISBN column. I will try.

Xmp proposal is also interesting I will have a look.

By the way if I posted here it is because I tried a lot of PDF import (without xmp) and I noticed strange behavior especially with tag and I was wondering how the import was done and if it was configurable or documented somewhere. For example tag separated by ';'are not recognized, first line of subject is also see as a tag, etc...

As calibre is able to detect 'line' in subject of a PDF, it could be awesome if first line is version, second line is publish date, other lines is real subject, etc... This way standard PDF metadata are used, editable everywhere and seen easily from windows...

BetterRed · 04-04-2015, 02:42 AM

@iostrym - calibre has a vast array options so its impossible to remember them all, even Kovid Goyal sometimes misremembers or forgets something.

I just discovered that if you're prepared to put the version number in the file name then you can probably put it in the isbn number without using Bulk S&R

Click image for larger version

Name: Capture.JPG
Views: 723
Size: 129.0 KB
ID: 136751

BR

iostrym · 04-04-2015, 05:41 PM

hi,

i did test with exiftool, calibre, adobe reader and pdfxchange.

the strange behavior I had with tag is related with pdf x change. if I open and save a pdfxchange pdf with adobe reader, then the import in calibre is OK.

no tag problem.

did a diff between XMP metadata (extracted with exiftool) and the result is :

import ko pdf : linearized (no) and XMP Toolkit = XMP Core 4.1.1
import ok pdf : linearized (yes) and XMP Toolkit = Adobe XMP Core 5.4

by the way, solution using version in filename is not great because then metadata are no more used. So it is not a good solution

Just to be sure : add custom "http://calibre-ebook.com/xmp-namespace" metadata in the pdf could be read back by calibre during metadata ? that was your proposal but seems hard to do with a free pdf reader.

BetterRed · 04-04-2015, 06:40 PM

@iostrym - thanks for doing the tests. Sadly I think I'm out of ideas.

If it's any consolation there have been a number of queries recently that could have been addressed more easily if metadata could be extracted from the file name via a regular expression AND the from within the format file; where the file name extraction would take precedence.

If you want to make a formal request for such an enhancement you can do that here ==>> Bugs : calibre. If you do that then it might useful to include a link to this thread.

I can't assess if such an enhancement is feasible. However, when a CBZ is added the textual metadata can only be extracted from the file name, but the cover is always extracted from the file (1st image). So the rudiments would appear to be there.

Re your last point - its probably unrealistic to expect calibre metadata to be written by anything but calibre - unless you want to do it by hand in a text editor

BR

04-02-2015, 06:06 PM	#1
iostrym Connoisseur Posts: 73 Karma: 10 Join Date: Mar 2015 Device: kobo	Configure metadata from pdf import by Calibre I would like that when a pdf is added in calibre, a specific metadata from PDF is read and added in ISBN metada in calibre. Like it is done for epub where ISBN epub is paste in ISBN calibre metadata. In calibre I don't think this could be configured... I'm using ISBN as a version number to trace my evolution in PDF or epub. But maybe there is a better solution. By the way, what is the rule for Calibre for currently import pdf metadata ? I notice that : first word of pdf subjet + words in pdf keywords are used as calibre TAG. It is quite strange that first word of pdf subject is taken like that. Also I don't succeed to put several word in pdf "keyword" because there are seen as only one keyword. Even with "," or ";" or " " ex : toto, titi toto; titi etc... each time the calibre corresponding tag is "toto, titi" or "toto; titi" [EDIT] My fault, it is my pdf editor (pdf x change viewer) that add " " before and after all keywords... don't know why. is the official calibre metadata behavior when importing a pdf documented somewhere ? thanks a lot for your help Last edited by iostrym; 04-02-2015 at 06:18 PM.

04-02-2015, 07:00 PM	#3
iostrym Connoisseur Posts: 73 Karma: 10 Join Date: Mar 2015 Device: kobo	thank for your answer. But in fact what I'm trying to do is to set a version number like "V1" in a metadata of my pdf and I would like calibre when importing my pdf display my version number V1 in the ISBN column. because I'm using ISBN metadata like a version number. For epub there is no problem because isbn metadata is correctly imported from the epub but for pdf as this metadata doesn't exist... and concerning my keyword problem, it could be great if we could configure in calibre that not only "," can separate keywords but also ";" or others symbols. Because my ### pdf editor want only separate keywords with ";" ... but maybe there is a plugin for this... Last edited by iostrym; 04-02-2015 at 07:10 PM.

04-02-2015, 07:17 PM	#4
BetterRed null operator (he/him) Posts: 22,702 Karma: 33011292 Join Date: Mar 2012 Location: Sydney Australia Device: none	The ISBN column is not a discrete column - its a an entry in the the IDs column, which is a list, In that list ISBN looks something like 'isbn:1234567890123', an Amazon id looks something like 'asin:123456789' etc. As I understand matters, the 'only sensible' way to get values into the IDs columns is via a metadata download plugin, and that the calibre IDs column implements the Dublin Core identifiers element. I am not sure that PDF metadata is DC compliant, unless its embedded as XMP, I suspect PDF XCHange doesn't 'support' XMP encoded metadata - at least not the free version. Why don't you create a custom column for your version number, using the built in columns for other than there designated purposes (especially the IDs column) is a) not a good idea -- if for no other reason that it's usually a poor design decision, and b) it is unnecessary given how easy it is to create your own custom columns. BR Last edited by BetterRed; 04-02-2015 at 07:30 PM.

04-03-2015, 03:30 AM	#8
iostrym Connoisseur Posts: 73 Karma: 10 Join Date: Mar 2015 Device: kobo	I don't use dc identifier metadata in pdf. Are you saying that I could fill automatically a column in calibre using dc identifier metadata (xmp metadata) written by a PDF editor? Currently I put version (ex : v1) in subject metadata of PDF (first line of subject) then import PDF in calibre and manually cut past the version (v1) in subject to fill the identifier metadata in calibre (isbn:v1) Just to be clear my final goal is to have nothing to do (manually in GUI) in calibre after importing all my PDF and epub in it and display a column with a version number. Last edited by iostrym; 04-03-2015 at 03:35 AM.

04-03-2015, 05:26 AM	#9
BetterRed null operator (he/him) Posts: 22,702 Karma: 33011292 Join Date: Mar 2012 Location: Sydney Australia Device: none	I don't understand why you want to carry it as a badly formed ISBN in the IDs - what are the advantages ISBN numbers have an internationally agreed format, they are issued by the appropriate national authorities and if you click on them in Book Details they do a look up on the Online Computer Library Centre's World Catalog site. If you want your version number represented as an a identifier then why not label it as myversion:. But I cant see the advantages of that over a simple custom column called myversion - and its much easier to display in the Book List, book jacket, catalogue etc, than using a template every time to extract it from the identifiers list. IMO the only ways you might achieve your long term goals are a) to use the command line programs and some scripting, or create your own version of calibre from source. BR Last edited by BetterRed; 04-03-2015 at 05:36 AM.

04-02-2015, 07:41 PM	#5
iostrym Connoisseur Posts: 73 Karma: 10 Join Date: Mar 2015 Device: kobo	Thanks a lot for taking time to answer me. It seems I'm not clear enough. My problem has nothing to do with metadata download from ISBN value. I'm editing PDF and epub outside calibre. When one epub or PDF is modified I have to delete the old in calibre and import the new one. So I would like that all metadata from the new PDF to be automatically reimported in calibre (tag, author and so on) In epub i write, I use ISBN to store a version number this way the ISBN from my epub is recognized in calibre. And it works. Is there a solution for calibre to do the same as there is no official metadata for ISBN in PDF ? I'm missing a documentation on how calibre import metadata from PDF in calibre. And on internet everyone is interested in calibre to write metadata in PDF from calibre metadata whereas I want to do exactly the opposite: configure the way calibre import metadata from a PDF... Am I more understandable ? Regards

04-02-2015, 10:01 PM	#6
theducks Well trained by Cats Posts: 31,815 Karma: 64144480 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	I think you can roll your own Id Ver:100 The colon is the ID TYPE separator you can only have 1 of any Type in a book, buy you can have many types assigned

04-03-2015, 10:53 AM	#10
iostrym Connoisseur Posts: 73 Karma: 10 Join Date: Mar 2015 Device: kobo	thanks for helping. I use ISBN because when creating epub with write2epub I can write in ISBN metadata the version and it is automatically recognize by calibre if I display the column ISBN. And also because I have my own book and don't use at all the ISBN feature. that's why I was focused on ISBN metadata with pdf. But I could use another one (simple custom column called myversion) if I know how to fill this column automatically by calibre when importing an epub or a PDF (from version stored in a pdf or epub metadata) - I wont create a new version of calibre. - Why not using command line program and some scripting. for example : I add a lot of pdf and I run a tcl script that refresh calibre metadata from imported pdf. let's ignore epub version becase it works simply using ISBN. For Pdf you sent me some TCL commands but I don't know if I could use them because : ebook-meta : read only a selection of metadata so no "custom" metadata set-metadata : write a metadata in the opf file (calibre metadata) but this feature is already included in the ebook-meta command. you were speaking about XMP metadata in pdf, can Calibre read them ? even if it is custom metadata ?

04-03-2015, 05:01 PM	#11
BetterRed null operator (he/him) Posts: 22,702 Karma: 33011292 Join Date: Mar 2012 Location: Sydney Australia Device: none	I mentioned the 2 command line programs to remind you of their existence - only you can determine if they're relevant to your workflow. As far as I can tell PDF XChange can display the xmp metadata, see File->Document Properties->Additional Metadata->Advanced. The DC elements are in the http://purl.org/dc/elements/1.1 section. But I don't think it can change or add new elements directly. Given PDF and XMP are Adobe creations I would assume one or more of their tools can populate any DC element in the EPUB. I'm not sure if calibre reads the XMP data in the PDF when its adding books, why don't you do a test add some Identifiers to a library book that has a PDF, use calibre Embed Metadata (ctrl/e) to embed the metadata in the book, verify it's there with PDF XChange, save the PDF somewhere, delete the book from calibre library, add the saved PDF with Add options set to extract metadata from file Are the Identifiers there? I think Calibre will embed custom columns as XMP in an PDF, in a calibre section. But I don't know if it will read it back when a book is added - I suspect not because everyone's custom columns are different. Again why don't you try it out. If you were to put your version number into say the Title eg "Robinson Crusoe : Version 123" then after adding books you could use Bulk Metadata Edit: Search and Replace: Regular Expression facilities to extract the version number from the Title, put it in a custom column and remove it from Title - S&R definitions can be saved, to facilitate reuse. That's how many (most) people tackle this sort of issue. BR

04-04-2015, 02:12 AM	#12
iostrym Connoisseur Posts: 73 Karma: 10 Join Date: Mar 2015 Device: kobo	Thanks a lot. I will look info your title solution. Maybe without search and replace it will work as for ISBN:v1 were calibre is able to extract automatically v1 in a ISBN column. I will try. Xmp proposal is also interesting I will have a look. By the way if I posted here it is because I tried a lot of PDF import (without xmp) and I noticed strange behavior especially with tag and I was wondering how the import was done and if it was configurable or documented somewhere. For example tag separated by ';'are not recognized, first line of subject is also see as a tag, etc... As calibre is able to detect 'line' in subject of a PDF, it could be awesome if first line is version, second line is publish date, other lines is real subject, etc... This way standard PDF metadata are used, editable everywhere and seen easily from windows...

04-04-2015, 02:42 AM	#13
BetterRed null operator (he/him) Posts: 22,702 Karma: 33011292 Join Date: Mar 2012 Location: Sydney Australia Device: none	@iostrym - calibre has a vast array options so its impossible to remember them all, even Kovid Goyal sometimes misremembers or forgets something. I just discovered that if you're prepared to put the version number in the file name then you can probably put it in the isbn number without using Bulk S&R BR

04-04-2015, 05:41 PM	#14
iostrym Connoisseur Posts: 73 Karma: 10 Join Date: Mar 2015 Device: kobo	hi, i did test with exiftool, calibre, adobe reader and pdfxchange. the strange behavior I had with tag is related with pdf x change. if I open and save a pdfxchange pdf with adobe reader, then the import in calibre is OK. no tag problem. did a diff between XMP metadata (extracted with exiftool) and the result is : import ko pdf : linearized (no) and XMP Toolkit = XMP Core 4.1.1 import ok pdf : linearized (yes) and XMP Toolkit = Adobe XMP Core 5.4 by the way, solution using version in filename is not great because then metadata are no more used. So it is not a good solution Just to be sure : add custom "http://calibre-ebook.com/xmp-namespace" metadata in the pdf could be read back by calibre during metadata ? that was your proposal but seems hard to do with a free pdf reader.

04-04-2015, 06:40 PM	#15
BetterRed null operator (he/him) Posts: 22,702 Karma: 33011292 Join Date: Mar 2012 Location: Sydney Australia Device: none	@iostrym - thanks for doing the tests. Sadly I think I'm out of ideas. If it's any consolation there have been a number of queries recently that could have been addressed more easily if metadata could be extracted from the file name via a regular expression AND the from within the format file; where the file name extraction would take precedence. If you want to make a formal request for such an enhancement you can do that here ==>> Bugs : calibre. If you do that then it might useful to include a link to this thread. I can't assess if such an enhancement is feasible. However, when a CBZ is added the textual metadata can only be extracted from the file name, but the cover is always extracted from the file (1st image). So the rudiments would appear to be there. Re your last point - its probably unrealistic to expect calibre metadata to be written by anything but calibre - unless you want to do it by hand in a text editor BR

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Re-Import Kindle Book to Calibre w/o Losing Metadata?	enuddleyarbl	Library Management	3	09-03-2013 10:53 AM
calibre partially ignoring local metadata on import	mwhybark	Library Management	2	05-28-2013 02:05 AM
Configure Metadata download for ISBN overridden by Preferences	meme	Library Management	11	09-21-2011 11:47 PM
Import failed Error:404 when attempting to import from Calibre to Stanza	dvond	Apple Devices	0	05-13-2011 03:00 PM
Calibre 0.6.7 and pdf ebook import on µSD card (Bookeen Opus)	ericch	Calibre	2	08-19-2009 03:21 PM

Advert

Advert