![]() |
#1 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,507
Karma: 5703586
Join Date: Nov 2009
Device: many
|
Issue importing html zip archives and metadata parsing
Hi,
I am using the latest calibre (downloaded and installed today). I seem to be having trouble when importing a zip archive that has the following contents zip archive contents: book.html style.css img/*.jpeg When I import it as a .zip archive I get no metadata read from the html file at all. When I rename the .zip to .htmlz, I again get no metadata read from the html file at all. If I unzip it manually and then import book.html, everything works just fine (the metadata is recognized). I am designing a file conversion import plugin and I was trying to pass the output of the file plugin as a zip archive and wanted to manually test what happens when I do that. Is there some format or special file names I need to use in creating a zip archive so that upon importing it the html file is parsed properly for metadata. Thanks, KevinH |
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,221
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
add an opf file to the zip with the metadata.
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,507
Karma: 5703586
Join Date: Nov 2009
Device: many
|
Hi,
Okay I added the following metadata.opf and all of my metadata was properly parsed except for the cover. Is there something I am doing wrong with my metadata.opf when it comes to setting a cover image upon import: The contents of the zip archive are: book.html style.css cover.jpg metadata.opf img/*.jpg Here is my generated metadata.opf file: Code:
<?xml version='1.0' encoding='utf-8'?> <package xmlns="http://www.idpf.org/2007/opf" unique-identifier="guid_id"> <metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf"> <dc:identifier opf:scheme="GUID" id="guid_id">4f3807e13649d56d9cfa5e91beca6765</dc:identifier> <dc:identifier opf:scheme="ASIN">B001U3YDJK</dc:identifier> <dc:identifier opf:scheme="oASIN">0253342112</dc:identifier> <dc:title>Tank Driver: With the 11th Armored from the Battle of the Bulge to VE Day</dc:title> <dc:creator opf:role="aut">J. Ted Hartman;Ted J. Hartman</dc:creator> <dc:language>en</dc:language> <dc:date>20090126T20:24</dc:date> </metadata> <guide> <reference href="cover.jpg" type="cover" title="Cover"/> </guide> </package> Sorry to be so thick here but I am stumped. Thanks, KevinH |
![]() |
![]() |
![]() |
#4 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,221
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I don't think covers are ever read from zip files, unless they are identified as comics.
|
![]() |
![]() |
![]() |
#5 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,507
Karma: 5703586
Join Date: Nov 2009
Device: many
|
Hi,
Is there any way to change this. I would very much like to take the book in html format and allow it to be imported properly. Would it be possible to use a special extension such as .htmlz or bookz or something that would indicate to to calibre to look for a cover by parsing the opf? If I extend the metadata.opf to include a full manifest listing the cover.jpg, would that help? It just seems sad to leave the cover unidentified upon import when it is well known by the file conversion process. Also, immediately after import if I try to convert the book to pdf, I get a missing "Spine error" if I have the metadata.opf file in the zip archive. If I remove it (and lose all of the metadata) the conversion proceeds without issue. Thanks again for answering my questions. KevinH Kevin |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,221
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Use .epub you're almost there already. All you need is to add <manifest> and <spine> to the OPF
|
![]() |
![]() |
![]() |
#7 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,507
Karma: 5703586
Join Date: Nov 2009
Device: many
|
Hi,
Okay I can go for epub but I typically would have a single html file that is huge and I would rather not rewrite all of the code for detecting and splitting chapters, updating links, etc. I just wouldn't want anyone to take the .epub format I give to Calibre and write it to disk and try to load it on a Sony eReader and end up with one big "page Error". That was why I was hoping an ".htmlz" with an opf would act like a poor man's epub that forced people to convert it via calibre before trying to load it on their device. Thanks, Kevin |
![]() |
![]() |
![]() |
#8 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,221
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
you dont need to write all that code, you can convert epub to epub in calibre.
|
![]() |
![]() |
![]() |
#9 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,507
Karma: 5703586
Join Date: Nov 2009
Device: many
|
Hi,
So then how does a calibre plugin trigger a calibre epub to epub conversion after the run() method has completed? Is there a post-run callback of some sort? Thanks, Kevin |
![]() |
![]() |
![]() |
#10 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,221
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I'm confused, why does your plugin need to do an epub to epub cnoversion? You can just do that conversion as normal in calibre after the import has completed.
|
![]() |
![]() |
![]() |
#11 | |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,507
Karma: 5703586
Join Date: Nov 2009
Device: many
|
Quote:
This is a plugin for a common Kindle format book that can not be deciphered today. I assume the plugin will be used by many people not all of whom will remember they have to do an "epub to epub" conversion before exporting the book or syncing it with their reader of choice. If they sync it to their Sony reader as is, they will end up with a single giant html file and a "Page Error". I was hoping for a seamless file type plugin that would use all of the information available in the original book format and pass it nicely to calibre. I assumed an .htmlz or .zip archive with the proper files would be the best way to pass things through from the plugin to calibre itself. I will play around and see if I can figure something else out. Thanks, Kevin |
|
![]() |
![]() |
![]() |
#12 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,221
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Do you mean the topaz format? In that case,, why not just implement a conversion plugin for de-drmed topaz. That can be made part of calibre.
|
![]() |
![]() |
![]() |
#13 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,507
Karma: 5703586
Join Date: Nov 2009
Device: many
|
Hi,
Although, non-drm'd Topaz could exist, they do not seem to in the wild. The current "tools" do not take that approach since a non-drm Topaz originally could not easily be converted because it is internally a binary encoded data file that is really a poor man's version of an image only pdf file with some ocr info added to make searching possible. So a non-DRM Topaz file was really only good for sharing/piracy as having it changes nothing for the owner, they could read it only on Kindles before and could only read it on Kindles after -- all the DRM removal accomplished was to allow owners to post/share the file with others (something the tools authors did not want to support). The binary encoded data file itself needs to be converted to an incompletely reverse engineered xml using a dictionary lookup procedure, the custom xml then needs to be parsed, and the information which describes the image of the page needs to be combined with the internal OCR info to create something that is html based but unfortunately imperfect (the internal ocr can be horrible and all italics and most bolding is lost). The same binary data files can also be converted to a set of svg images of the page (perfect and scalable but not reflowable, unless you have an algorithm to reflow individual glyphs which need not map to any specific letter on the screen) So are you saying, that if the "tools" were somehow reverted to do nothing other than generate non-DRM topaz files, we could move all of the reversed engineered python code that was added later that handles the conversion of the file to html and a set of svg images right into calibre itself? I was not sure you would allow Calibre to host code that was reverse engineered. If so, we could certainly take that approach. My original idea was to create a file plugin that handled the "non-drm part" and the detailed conversion behind the scenes and then handed calibre the results of the conversion as one nice package of some sort - say a .tpzZ (for zip) file so that nothing internal to calibre need change except for adding a pseudo-file type (.tpzZ) type support which I was going to write and contribute to calibre so that no reverse-engineered code need be included. If you really would like to host the internal conversion code, I would be happy to contribute it and the authors of the standalone "tools" could revert to just creating non-drm Topaz files. It is really your choice. If you are interested, I will can take the latest versions, strip out the drm removal code pieces, and just send you just a working converter program for you to play around with (pure python). I just thought that "all in a plugin" would be the safest approach. Take care, KevinH |
![]() |
![]() |
![]() |
#14 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,221
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I am perfectly fine with adding code to convert non drmed topaz. calibre's MOBI conversion code is also reverse engineered. I just dont want any DRM removal code in calibre, as that would violate the DMCA.
So if you write an InputPlugin to convert on DRMed topaz files, I will be happy to merge it with the calibre code base. |
![]() |
![]() |
![]() |
#15 |
Sigil Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 8,507
Karma: 5703586
Join Date: Nov 2009
Device: many
|
Hi,
Okay, I will grab the calibre source and look into doing just that. Thanks, Kevin |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Importing - Metadata aquisition | Justy | Calibre | 1 | 02-05-2010 03:44 PM |
why does html appears as Zip? | yasmeen57 | Calibre | 6 | 10-06-2009 11:25 AM |
regex Issue when Importing | river | Calibre | 3 | 06-16-2009 11:03 AM |
Multiple html issue - too many links and zip isn't created in calibre | Katelyn | Calibre | 1 | 03-10-2009 01:31 PM |
Conversion issue with zip of Warbreaker | Mitchll | Calibre | 6 | 07-28-2008 06:25 PM |