Extra metadata import from ODT - Page 2

olig · 07-27-2012, 01:17 PM

Something does not work... looking into it.

olig · 07-27-2012, 01:18 PM

Ah, easy, you killed the 'return mi' at the end of get_metadata, perhaps because you put it into read_cover.

olig · 07-27-2012, 01:44 PM

I fixed it in my branch and optimized it a bit (no read_cover if opf.nocover).

kovidgoyal · 07-27-2012, 03:12 PM

merged.

olig · 07-28-2012, 03:12 AM

I started to work on conversion quirks. One particular problem now is that a convert will duplicate the cover image.

My call:

Code:

ebook-convert book.odt book.epub --output-profile sony300 \
  --preserve-cover-aspect-ratio \
  --no-svg-cover --no-default-epub-cover \
  --filter-css margin-top,margin-right,margin-left,margin-bottom,position,top,width

This will result in a extra titlepage.xhtml generated by the conversion, right in front of the titlepage that is in the converted xhtml of the ODT document.

I suspect I did create this by returning the cover image in get_metadata. The only logical way to work around this would be to strip the cover image markup from the source in the input process. This would imply that a cover image has not to be in the markup of the book.

Or is there something I am missing to work around it?

kovidgoyal · 07-28-2012, 03:30 AM

There is no general fix for this. You can always end up with duplicated images when converting formats that dont have the concept of a cover. I have committed (an untested) fix to change the ODT input plugin to not use the first image as cover. However, you can still get duplicated images if for instance, a user adds an ODT to calibre, which gets its first image as the cover. Then convert, which will set that extracted first image as a cover. For these kinds of problems, there is the --remove-first-image option.

olig · 07-28-2012, 03:45 AM

Hmmm... if I understand it right, your change does call get_metadata with extract_cover set to False. But this still returns the cover href (as I programmed odt get_metadata) and only does not return cover_data. A test also shows that it changes nothing.

Here the first question is: if extract_cover is False, should get_metadata return neither cover nor cover_data? If yes, I need to change this.

But: I want it to have the cover href, so it is set as cover in the content.opf. So inhibiting the detection might be the wrong way. Except if metadata import/convert is separated from content import/convert (sorry, I'm not yet that deep into the code).

I still think selective removal of the detected cover image would be the better way to do it. I can also exactly identify this image in the ODT source, so I could also remove exactly this image.

olig · 07-28-2012, 03:56 AM

The option text is interesting:

Quote:

--remove-first-image

Remove the first image from the input ebook. Useful if the input document has a cover image that is not identified as a cover. In this case, if you set a cover in calibre, the output document will end up with two cover images if you do not specify this option.

So that implies that if the cover image is detected (it is, at least in get_metadata) there should be no duplication. In result the question is: if the cover image is detected, who will remove it from the resulting xhtml in the input process? (if it is not some meta information but real markup)

olig · 07-28-2012, 04:28 AM

Quote:

Originally Posted by olig

Hmmm... if I understand it right, your change does call get_metadata with extract_cover set to False. But this still returns the cover href (as I programmed odt get_metadata) and only does not return cover_data. A test also shows that it changes nothing.

Did read some more code... and it still looks wrong: as I understand it it removes the cover from the data which is given to OPFCreator. So in result I would suspect that it disappears from the OPF, not from the markup. This is exactly what I don't want.

The option that would help me is --dont-create-extra-titlepage. This is much better control over what the result should be than --remove-frist-image.

In retrospect I don't even know what the transform stage of the conversion pipeline expects to get from the input plugin. As I would understand it, it should be defined if a cover image should be in two places or only in one (the two places are the metadata and the actual markup).

As it is currently in two places, it is no wonder that the process creates another cover, as it seems to think: hey, there is a cover image in the metadata, it is certainly not in the markup, so I need to add it. From this it seems to me that the convert pipeline does not expect the cover to be in the markup if it is in the metadata.

On the other hand this robs the ability of control, because you can't put your cover on the second page (I don't really know if somebody wants to do this, it's hypothetical). So in my eyes you need to be able to forbid the creation of a titlepage at all.

olig · 07-28-2012, 05:32 AM

FYI: I think there is something missing in customize.builtins or metadata.odt directly so that the quick_metadata hack works.

But I still don't think that this is the solution.

Looking into oeb.transform.cover it seems to me that as soon as there is a cover href set in the metadata, the titlepage will be generated. And without insert_cover being executed, there will be no OPF meta cover set to the item that is the cover image. Is this correct?

I see a difference in marking a image as cover in the metadata and adding content to a document.

BTW: I'm just rattling down my thoughts here. Please don't see this as urging you to fix something, I'm happy to do the coding as soon as I'm sure how it is supposed to be

olig · 07-28-2012, 07:50 AM

A bit more code reading: a conversion to ePub works differently as for example to MOBI (hah, surprise!).

So there is not really one place to remove the extra titlepage (as I understand it mobi displays the cover always before, without extra markup, and this titlepage.xhtml is only generated in the ePub generation).

So removing the Image (Frame, Paragraph) from the ODT seems to be the most compatible way of handling this.

I added code for the image removal. It's a very easy solution, I just need to remove the parent text

element from the document. After this everything looks like I expect it both with MOBI and ePub.

But it's not committed yet, I have to do some more tests to make this more robust.

One question: is it ok to use the mi object to pass the frame id from get_metadata back to the odt import? I would set it as mi._odf_cover_frame

kovidgoyal · 07-28-2012, 09:07 AM

The conversion pipeline expects the input plugin to supply it a book with a cover that is not part of the spine. For formats like ODT where there is no concept of cover, the input plugin has to guess. In theory the input plugin could set the cover and remove the image from the content, but this is wrong, because when it gets its wrong (i.e. the first image is not a cover) it can result in data loss instead of simple duplication. Which is why there is a --remove-first-image which the user can do after verifying manually that the first image is indeed a cover and should be removed.

So, in short, there is no way to build a robust automated solution for the general case. In your specific case, you can have the input plugin remove the first image if it is specifically identified as the cover via the custom opf.cover metadata. This is the appraoch that the epub input plugin takes. EPUB also is in the situation where it may or may not have a well defined cover.

olig · 07-28-2012, 12:38 PM

Ok, sounds reasonable.

I need a clean way for transporting the frame name from get_metadata to the odt input function. But custom attrs in Metdata get not copied by smart_update in meta._get_metadata. Is there a way I do not see?

kovidgoyal · 07-28-2012, 12:47 PM

Use get_metadata from the metadata/odt.py directly instead of the full get_metadata(). Note that if you do this, make sure you set the title and author of the mi object to some reasonable values if they are not set by get_metadata().

olig · 07-28-2012, 01:59 PM

Ok. It works, but I need to do some more tests, with different paragraph mixes, to be sure to catch all cases.

BTW: While doing my tests the most annoying thing where lots of 'Unknown' lines in the MOBI conversions. Line 318 in ebooks/mobi/utils.py is the reason that every empty paragraph in my source gets replaced by 'Unknown'. Well, at least this beasts does the replace, there as to be a place that calls this for every string. Perhaps there should a switch for this so that Unknown is only replaced for meaningful tags like title?

Edit: it's the call to utf8_text in ebooks/mobi/writer2/serializer.py:383

Edit 2: As I read the commit that changed this (#12785), it is not about empty strings, but only about accented characters. So perhaps the best solution would be to add a empty keyword to the utf8_text that defaults to False and depending the replace with Unknown on this.

Edit 3: Fixed this in my branch with rev 12795.

07-28-2012, 03:12 AM	#20
olig Enthusiast Posts: 32 Karma: 12 Join Date: Jul 2012 Device: Kindle 4nt 4.1.3 jailbreak	I started to work on conversion quirks. One particular problem now is that a convert will duplicate the cover image. My call: Code: ebook-convert book.odt book.epub --output-profile sony300 \ --preserve-cover-aspect-ratio \ --no-svg-cover --no-default-epub-cover \ --filter-css margin-top,margin-right,margin-left,margin-bottom,position,top,width This will result in a extra titlepage.xhtml generated by the conversion, right in front of the titlepage that is in the converted xhtml of the ODT document. I suspect I did create this by returning the cover image in get_metadata. The only logical way to work around this would be to strip the cover image markup from the source in the input process. This would imply that a cover image has not to be in the markup of the book. Or is there something I am missing to work around it?

07-28-2012, 01:59 PM	#30
olig Enthusiast Posts: 32 Karma: 12 Join Date: Jul 2012 Device: Kindle 4nt 4.1.3 jailbreak	Ok. It works, but I need to do some more tests, with different paragraph mixes, to be sure to catch all cases. BTW: While doing my tests the most annoying thing where lots of 'Unknown' lines in the MOBI conversions. Line 318 in ebooks/mobi/utils.py is the reason that every empty paragraph in my source gets replaced by 'Unknown'. Well, at least this beasts does the replace, there as to be a place that calls this for every string. Perhaps there should a switch for this so that Unknown is only replaced for meaningful tags like title? Edit: it's the call to utf8_text in ebooks/mobi/writer2/serializer.py:383 Edit 2: As I read the commit that changed this (#12785), it is not about empty strings, but only about accented characters. So perhaps the best solution would be to add a empty keyword to the utf8_text that defaults to False and depending the replace with Unknown on this. Edit 3: Fixed this in my branch with rev 12795. Last edited by olig; 07-28-2012 at 02:36 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Auto Download Metadata on Import	ebookrights	Calibre	2	12-18-2012 10:51 AM
Import MetaData an Tags	adrian142	Library Management	0	04-03-2012 11:40 AM
Import metadata from file	Vinavil	Library Management	2	01-28-2012 03:48 PM
Mixing metadata on import	PeteMan	Calibre	2	01-03-2011 02:21 PM
Import: prioritization of metadata source?	ATimson	Calibre	2	02-28-2010 03:57 PM

07-27-2012, 01:17 PM	#16
olig Enthusiast Posts: 32 Karma: 12 Join Date: Jul 2012 Device: Kindle 4nt 4.1.3 jailbreak	Something does not work... looking into it.

07-27-2012, 01:18 PM	#17
olig Enthusiast Posts: 32 Karma: 12 Join Date: Jul 2012 Device: Kindle 4nt 4.1.3 jailbreak	Ah, easy, you killed the 'return mi' at the end of get_metadata, perhaps because you put it into read_cover.

07-27-2012, 01:44 PM	#18
olig Enthusiast Posts: 32 Karma: 12 Join Date: Jul 2012 Device: Kindle 4nt 4.1.3 jailbreak	I fixed it in my branch and optimized it a bit (no read_cover if opf.nocover).

07-27-2012, 03:12 PM	#19
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	merged.

07-28-2012, 03:30 AM	#21
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	There is no general fix for this. You can always end up with duplicated images when converting formats that dont have the concept of a cover. I have committed (an untested) fix to change the ODT input plugin to not use the first image as cover. However, you can still get duplicated images if for instance, a user adds an ODT to calibre, which gets its first image as the cover. Then convert, which will set that extracted first image as a cover. For these kinds of problems, there is the --remove-first-image option.

07-28-2012, 03:45 AM	#22
olig Enthusiast Posts: 32 Karma: 12 Join Date: Jul 2012 Device: Kindle 4nt 4.1.3 jailbreak	Hmmm... if I understand it right, your change does call get_metadata with extract_cover set to False. But this still returns the cover href (as I programmed odt get_metadata) and only does not return cover_data. A test also shows that it changes nothing. Here the first question is: if extract_cover is False, should get_metadata return neither cover nor cover_data? If yes, I need to change this. But: I want it to have the cover href, so it is set as cover in the content.opf. So inhibiting the detection might be the wrong way. Except if metadata import/convert is separated from content import/convert (sorry, I'm not yet that deep into the code). I still think selective removal of the detected cover image would be the better way to do it. I can also exactly identify this image in the ODT source, so I could also remove exactly this image.

07-28-2012, 05:32 AM	#25
olig Enthusiast Posts: 32 Karma: 12 Join Date: Jul 2012 Device: Kindle 4nt 4.1.3 jailbreak	FYI: I think there is something missing in customize.builtins or metadata.odt directly so that the quick_metadata hack works. But I still don't think that this is the solution. Looking into oeb.transform.cover it seems to me that as soon as there is a cover href set in the metadata, the titlepage will be generated. And without insert_cover being executed, there will be no OPF meta cover set to the item that is the cover image. Is this correct? I see a difference in marking a image as cover in the metadata and adding content to a document. BTW: I'm just rattling down my thoughts here. Please don't see this as urging you to fix something, I'm happy to do the coding as soon as I'm sure how it is supposed to be

07-28-2012, 07:50 AM	#26
olig Enthusiast Posts: 32 Karma: 12 Join Date: Jul 2012 Device: Kindle 4nt 4.1.3 jailbreak	A bit more code reading: a conversion to ePub works differently as for example to MOBI (hah, surprise!). So there is not really one place to remove the extra titlepage (as I understand it mobi displays the cover always before, without extra markup, and this titlepage.xhtml is only generated in the ePub generation). So removing the Image (Frame, Paragraph) from the ODT seems to be the most compatible way of handling this. I added code for the image removal. It's a very easy solution, I just need to remove the parent text element from the document. After this everything looks like I expect it both with MOBI and ePub. But it's not committed yet, I have to do some more tests to make this more robust. One question: is it ok to use the mi object to pass the frame id from get_metadata back to the odt import? I would set it as mi._odf_cover_frame

07-28-2012, 09:07 AM	#27
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	The conversion pipeline expects the input plugin to supply it a book with a cover that is not part of the spine. For formats like ODT where there is no concept of cover, the input plugin has to guess. In theory the input plugin could set the cover and remove the image from the content, but this is wrong, because when it gets its wrong (i.e. the first image is not a cover) it can result in data loss instead of simple duplication. Which is why there is a --remove-first-image which the user can do after verifying manually that the first image is indeed a cover and should be removed. So, in short, there is no way to build a robust automated solution for the general case. In your specific case, you can have the input plugin remove the first image if it is specifically identified as the cover via the custom opf.cover metadata. This is the appraoch that the epub input plugin takes. EPUB also is in the situation where it may or may not have a well defined cover.

07-28-2012, 12:38 PM	#28
olig Enthusiast Posts: 32 Karma: 12 Join Date: Jul 2012 Device: Kindle 4nt 4.1.3 jailbreak	Ok, sounds reasonable. I need a clean way for transporting the frame name from get_metadata to the odt input function. But custom attrs in Metdata get not copied by smart_update in meta._get_metadata. Is there a way I do not see?

Advert

Advert

07-28-2012, 12:47 PM	#29
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Use get_metadata from the metadata/odt.py directly instead of the full get_metadata(). Note that if you do this, make sure you set the title and author of the mi object to some reasonable values if they are not set by get_metadata().