Issue importing html zip archives and metadata parsing

KevinH · 12-24-2010, 01:29 PM

Hi,

I am using the latest calibre (downloaded and installed today). I seem to be having trouble when importing a zip archive that has the following contents

zip archive contents:

book.html
style.css
img/*.jpeg

When I import it as a .zip archive I get no metadata read from the html file at all.

When I rename the .zip to .htmlz, I again get no metadata read from the html file at all.

If I unzip it manually and then import book.html, everything works just fine (the metadata is recognized).

I am designing a file conversion import plugin and I was trying to pass the output of the file plugin as a zip archive and wanted to manually test what happens when I do that.

Is there some format or special file names I need to use in creating a zip archive so that upon importing it the html file is parsed properly for metadata.

Thanks,

KevinH

kovidgoyal · 12-24-2010, 01:31 PM

add an opf file to the zip with the metadata.

KevinH · 12-24-2010, 02:55 PM

Hi,

Okay I added the following metadata.opf and all of my metadata was properly parsed except for the cover.

Is there something I am doing wrong with my metadata.opf when it comes to setting a cover image upon import:

The contents of the zip archive are:

book.html
style.css
cover.jpg
metadata.opf
img/*.jpg

Here is my generated metadata.opf file:

Code:

<?xml version='1.0' encoding='utf-8'?>
<package xmlns="http://www.idpf.org/2007/opf" unique-identifier="guid_id">
   <metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
      <dc:identifier opf:scheme="GUID" id="guid_id">4f3807e13649d56d9cfa5e91beca6765</dc:identifier>
      <dc:identifier opf:scheme="ASIN">B001U3YDJK</dc:identifier>
      <dc:identifier opf:scheme="oASIN">0253342112</dc:identifier>
      <dc:title>Tank Driver: With the 11th Armored from the Battle of the Bulge to VE Day</dc:title>
      <dc:creator opf:role="aut">J. Ted Hartman;Ted J. Hartman</dc:creator>
      <dc:language>en</dc:language>
      <dc:date>20090126T20:24</dc:date>
   </metadata>
   <guide>
      <reference href="cover.jpg" type="cover" title="Cover"/>
   </guide>
</package>

The book.html file does not directly reference cover.jpg (it references instead img/img0000.jpg) but I tried using href="img/img0000.jpg" in the guide element to no avail.

Sorry to be so thick here but I am stumped.

Thanks,

KevinH

kovidgoyal · 12-24-2010, 05:07 PM

I don't think covers are ever read from zip files, unless they are identified as comics.

KevinH · 12-24-2010, 10:00 PM

Hi,

Is there any way to change this. I would very much like to take the book in html format and allow it to be imported properly.

Would it be possible to use a special extension such as .htmlz or bookz or something that would indicate to to calibre to look for a cover by parsing the opf?

If I extend the metadata.opf to include a full manifest listing the cover.jpg, would that help? It just seems sad to leave the cover unidentified upon import when it is well known by the file conversion process.

Also, immediately after import if I try to convert the book to pdf, I get a missing "Spine error" if I have the metadata.opf file in the zip archive. If I remove it (and lose all of the metadata) the conversion proceeds without issue.

Thanks again for answering my questions.

KevinH

Kevin

kovidgoyal · 12-24-2010, 10:16 PM

Use .epub you're almost there already. All you need is to add <manifest> and <spine> to the OPF

KevinH · 12-24-2010, 10:35 PM

Hi,

Okay I can go for epub but I typically would have a single html file that is huge and I would rather not rewrite all of the code for detecting and splitting chapters, updating links, etc.

I just wouldn't want anyone to take the .epub format I give to Calibre and write it to disk and try to load it on a Sony eReader and end up with one big "page Error".

That was why I was hoping an ".htmlz" with an opf would act like a poor man's epub that forced people to convert it via calibre before trying to load it on their device.

Thanks,

Kevin

kovidgoyal · 12-24-2010, 11:12 PM

you dont need to write all that code, you can convert epub to epub in calibre.

KevinH · 12-25-2010, 07:42 AM

Hi,

So then how does a calibre plugin trigger a calibre epub to epub conversion after the run() method has completed? Is there a post-run callback of some sort?

Thanks,

Kevin

kovidgoyal · 12-25-2010, 10:53 AM

I'm confused, why does your plugin need to do an epub to epub cnoversion? You can just do that conversion as normal in calibre after the import has completed.

KevinH · 12-25-2010, 02:29 PM

Quote:

Originally Posted by kovidgoyal

I'm confused, why does your plugin need to do an epub to epub cnoversion? You can just do that conversion as normal in calibre after the import has completed.

Hi,

This is a plugin for a common Kindle format book that can not be deciphered today. I assume the plugin will be used by many people not all of whom will remember they have to do an "epub to epub" conversion before exporting the book or syncing it with their reader of choice. If they sync it to their Sony reader as is, they will end up with a single giant html file and a "Page Error". I was hoping for a seamless file type plugin that would use all of the information available in the original book format and pass it nicely to calibre.

I assumed an .htmlz or .zip archive with the proper files would be the best way to pass things through from the plugin to calibre itself.

I will play around and see if I can figure something else out.

Thanks,

Kevin

kovidgoyal · 12-25-2010, 02:35 PM

Do you mean the topaz format? In that case,, why not just implement a conversion plugin for de-drmed topaz. That can be made part of calibre.

KevinH · 12-25-2010, 04:25 PM

Hi,

Although, non-drm'd Topaz could exist, they do not seem to in the wild. The current "tools" do not take that approach since a non-drm Topaz originally could not easily be converted because it is internally a binary encoded data file that is really a poor man's version of an image only pdf file with some ocr info added to make searching possible.

So a non-DRM Topaz file was really only good for sharing/piracy as having it changes nothing for the owner, they could read it only on Kindles before and could only read it on Kindles after -- all the DRM removal accomplished was to allow owners to post/share the file with others (something the tools authors did not want to support).

The binary encoded data file itself needs to be converted to an incompletely reverse engineered xml using a dictionary lookup procedure, the custom xml then needs to be parsed, and the information which describes the image of the page needs to be combined with the internal OCR info to create something that is html based but unfortunately imperfect (the internal ocr can be horrible and all italics and most bolding is lost). The same binary data files can also be converted to a set of svg images of the page (perfect and scalable but not reflowable, unless you have an algorithm to reflow individual glyphs which need not map to any specific letter on the screen)

So are you saying, that if the "tools" were somehow reverted to do nothing other than generate non-DRM topaz files, we could move all of the reversed engineered python code that was added later that handles the conversion of the file to html and a set of svg images right into calibre itself?

I was not sure you would allow Calibre to host code that was reverse engineered. If so, we could certainly take that approach.

My original idea was to create a file plugin that handled the "non-drm part" and the detailed conversion behind the scenes and then handed calibre the results of the conversion as one nice package of some sort - say a .tpzZ (for zip) file so that nothing internal to calibre need change except for adding a pseudo-file type (.tpzZ) type support which I was going to write and contribute to calibre so that no reverse-engineered code need be included.

If you really would like to host the internal conversion code, I would be happy to contribute it and the authors of the standalone "tools" could revert to just creating non-drm Topaz files.

It is really your choice. If you are interested, I will can take the latest versions, strip out the drm removal code pieces, and just send you just a working converter program for you to play around with (pure python). I just thought that "all in a plugin" would be the safest approach.

Take care,

KevinH

kovidgoyal · 12-25-2010, 05:01 PM

I am perfectly fine with adding code to convert non drmed topaz. calibre's MOBI conversion code is also reverse engineered. I just dont want any DRM removal code in calibre, as that would violate the DMCA.

So if you write an InputPlugin to convert on DRMed topaz files, I will be happy to merge it with the calibre code base.

KevinH · 12-25-2010, 05:29 PM

Hi,

Okay, I will grab the calibre source and look into doing just that.

Thanks,

Kevin

12-24-2010, 01:29 PM	#1
KevinH Sigil Developer Posts: 9,819 Karma: 7500000 Join Date: Nov 2009 Device: many	Issue importing html zip archives and metadata parsing Hi, I am using the latest calibre (downloaded and installed today). I seem to be having trouble when importing a zip archive that has the following contents zip archive contents: book.html style.css img/*.jpeg When I import it as a .zip archive I get no metadata read from the html file at all. When I rename the .zip to .htmlz, I again get no metadata read from the html file at all. If I unzip it manually and then import book.html, everything works just fine (the metadata is recognized). I am designing a file conversion import plugin and I was trying to pass the output of the file plugin as a zip archive and wanted to manually test what happens when I do that. Is there some format or special file names I need to use in creating a zip archive so that upon importing it the html file is parsed properly for metadata. Thanks, KevinH

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Importing - Metadata aquisition	Justy	Calibre	1	02-05-2010 03:44 PM
why does html appears as Zip?	yasmeen57	Calibre	6	10-06-2009 11:25 AM
regex Issue when Importing	river	Calibre	3	06-16-2009 11:03 AM
Multiple html issue - too many links and zip isn't created in calibre	Katelyn	Calibre	1	03-10-2009 01:31 PM
Conversion issue with zip of Warbreaker	Mitchll	Calibre	6	07-28-2008 06:25 PM

12-24-2010, 01:31 PM	#2
kovidgoyal creator of calibre Posts: 46,383 Karma: 29630884 Join Date: Oct 2006 Location: Mumbai, India Device: Various	add an opf file to the zip with the metadata.

12-24-2010, 05:07 PM	#4
kovidgoyal creator of calibre Posts: 46,383 Karma: 29630884 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I don't think covers are ever read from zip files, unless they are identified as comics.

12-24-2010, 10:00 PM	#5
KevinH Sigil Developer Posts: 9,819 Karma: 7500000 Join Date: Nov 2009 Device: many	Hi, Is there any way to change this. I would very much like to take the book in html format and allow it to be imported properly. Would it be possible to use a special extension such as .htmlz or bookz or something that would indicate to to calibre to look for a cover by parsing the opf? If I extend the metadata.opf to include a full manifest listing the cover.jpg, would that help? It just seems sad to leave the cover unidentified upon import when it is well known by the file conversion process. Also, immediately after import if I try to convert the book to pdf, I get a missing "Spine error" if I have the metadata.opf file in the zip archive. If I remove it (and lose all of the metadata) the conversion proceeds without issue. Thanks again for answering my questions. KevinH Kevin

12-24-2010, 10:16 PM	#6
kovidgoyal creator of calibre Posts: 46,383 Karma: 29630884 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Use .epub you're almost there already. All you need is to add <manifest> and <spine> to the OPF

12-24-2010, 10:35 PM	#7
KevinH Sigil Developer Posts: 9,819 Karma: 7500000 Join Date: Nov 2009 Device: many	Hi, Okay I can go for epub but I typically would have a single html file that is huge and I would rather not rewrite all of the code for detecting and splitting chapters, updating links, etc. I just wouldn't want anyone to take the .epub format I give to Calibre and write it to disk and try to load it on a Sony eReader and end up with one big "page Error". That was why I was hoping an ".htmlz" with an opf would act like a poor man's epub that forced people to convert it via calibre before trying to load it on their device. Thanks, Kevin

12-24-2010, 11:12 PM	#8
kovidgoyal creator of calibre Posts: 46,383 Karma: 29630884 Join Date: Oct 2006 Location: Mumbai, India Device: Various	you dont need to write all that code, you can convert epub to epub in calibre.

12-25-2010, 07:42 AM	#9
KevinH Sigil Developer Posts: 9,819 Karma: 7500000 Join Date: Nov 2009 Device: many	Hi, So then how does a calibre plugin trigger a calibre epub to epub conversion after the run() method has completed? Is there a post-run callback of some sort? Thanks, Kevin

12-25-2010, 10:53 AM	#10
kovidgoyal creator of calibre Posts: 46,383 Karma: 29630884 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I'm confused, why does your plugin need to do an epub to epub cnoversion? You can just do that conversion as normal in calibre after the import has completed.

12-25-2010, 02:35 PM	#12
kovidgoyal creator of calibre Posts: 46,383 Karma: 29630884 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Do you mean the topaz format? In that case,, why not just implement a conversion plugin for de-drmed topaz. That can be made part of calibre.

12-25-2010, 04:25 PM	#13
KevinH Sigil Developer Posts: 9,819 Karma: 7500000 Join Date: Nov 2009 Device: many	Hi, Although, non-drm'd Topaz could exist, they do not seem to in the wild. The current "tools" do not take that approach since a non-drm Topaz originally could not easily be converted because it is internally a binary encoded data file that is really a poor man's version of an image only pdf file with some ocr info added to make searching possible. So a non-DRM Topaz file was really only good for sharing/piracy as having it changes nothing for the owner, they could read it only on Kindles before and could only read it on Kindles after -- all the DRM removal accomplished was to allow owners to post/share the file with others (something the tools authors did not want to support). The binary encoded data file itself needs to be converted to an incompletely reverse engineered xml using a dictionary lookup procedure, the custom xml then needs to be parsed, and the information which describes the image of the page needs to be combined with the internal OCR info to create something that is html based but unfortunately imperfect (the internal ocr can be horrible and all italics and most bolding is lost). The same binary data files can also be converted to a set of svg images of the page (perfect and scalable but not reflowable, unless you have an algorithm to reflow individual glyphs which need not map to any specific letter on the screen) So are you saying, that if the "tools" were somehow reverted to do nothing other than generate non-DRM topaz files, we could move all of the reversed engineered python code that was added later that handles the conversion of the file to html and a set of svg images right into calibre itself? I was not sure you would allow Calibre to host code that was reverse engineered. If so, we could certainly take that approach. My original idea was to create a file plugin that handled the "non-drm part" and the detailed conversion behind the scenes and then handed calibre the results of the conversion as one nice package of some sort - say a .tpzZ (for zip) file so that nothing internal to calibre need change except for adding a pseudo-file type (.tpzZ) type support which I was going to write and contribute to calibre so that no reverse-engineered code need be included. If you really would like to host the internal conversion code, I would be happy to contribute it and the authors of the standalone "tools" could revert to just creating non-drm Topaz files. It is really your choice. If you are interested, I will can take the latest versions, strip out the drm removal code pieces, and just send you just a working converter program for you to play around with (pure python). I just thought that "all in a plugin" would be the safest approach. Take care, KevinH

Advert

Advert

12-25-2010, 05:01 PM	#14
kovidgoyal creator of calibre Posts: 46,383 Karma: 29630884 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I am perfectly fine with adding code to convert non drmed topaz. calibre's MOBI conversion code is also reverse engineered. I just dont want any DRM removal code in calibre, as that would violate the DMCA. So if you write an InputPlugin to convert on DRMed topaz files, I will be happy to merge it with the calibre code base.

12-25-2010, 05:29 PM	#15
KevinH Sigil Developer Posts: 9,819 Karma: 7500000 Join Date: Nov 2009 Device: many	Hi, Okay, I will grab the calibre source and look into doing just that. Thanks, Kevin