Issue importing html zip archives and metadata parsing - Page 2

KevinH · 12-25-2010, 07:13 PM

Hi,

FYI: a Topaz Input plugin would be a one way street. There really is no way to convert any other xhtml based file to the Topaz file format. The inputs required are really scanning based (list of glyphs and paths to create each glyph, x,y positions of each glyph on the page (and the glyphs are not like font glyphs as they have no baselines), ocr info, page continuation info, dehyphenation info, fixed page format, etc,.

So I actually think it would be easier to create a .tpzZ input plugin that would take the archive with html, cover.jpeg, opf that I have already generated and handle things. The generated html is xhtml which can be input directly into an lxml tree without need for tidy or beautiful soup so it is very close to your internal calibre format as it stands right now.

It is almost as if we need a new file type extension based on zip called ".calibre" that represents Calibre's internal format and I could modify the plugin code I have now to write to that standard and calibre could output to that code as well when exporting to disk.

Kevin

kovidgoyal · 12-25-2010, 07:44 PM

just use zip in that case. If you want the cover to import, modify the zip metadata reader to read covers from OPF files, IIRC the code is in metadata.archive

KevinH · 12-25-2010, 10:02 PM

Hi,

Forgive me if I am messed up as I am new to this code but I think metadata.archive falls back to metadata.zip if there is no known type inside the archive (unless a comic).

And it looks like metadata/zip.py already will detect an .opf file if inside the zip and will invoke meta.get_metadata(stream,'opf') which in turn will invoke opf_metadata(path) which will parse it properly including the cover information from the manifest and the guide.

The issue is that since libprs is not forced to be True and application_id is None since this book is not part of calibre yet), meta.get_Metadata(stream,stream_type) will not return the opf information just collected, but will instead try to drag meta information from the filename and just before returning will do a base.smart_update(opf) with the opf meta information.

The issue is that opf.smart.update() will not update the cover or cover_data attributes as they are not on the list of attributes it will try to smartly update.

Unless this would mess you up, the easiest fix would be to do something along these lines (I think)

--- opf2.py 2010-12-23 16:39:20.000000000 -0500
+++ opf2_new.py 2010-12-25 20:47:37.000000000 -0500
@@ -990,7 +990,7 @@
for attr in ('title', 'authors', 'author_sort', 'title_sort',
'publisher', 'series', 'series_index', 'rating',
'isbn', 'tags', 'category', 'comments',
- 'pubdate'):
+ 'pubdate', 'cover', 'cover_data'):
val = getattr(mi, attr, None)
if val is not None and val != [] and val != (None, None):
setattr(self, attr, val)

Am I understanding this correctly?

kovidgoyal · 12-25-2010, 10:04 PM

open a ticket for it as I am travelling for the next few days so this post will get lost.

KevinH · 12-25-2010, 10:16 PM

Hi,

Will do. Have fun on your travels.

Thanks again for all of your help.

Kevin

KevinH · 12-26-2010, 12:57 AM

Hi,

Please ignore my previous attempt at a patch. The code I thought was running did not because it was protected by an application_id not None check that had to be worked around.

I created a new patch to metadata/zip.py that I tested and it does do what I wanted.
As requested I have created a bug tracker issue with the patch attached.

http://bugs.calibre-ebook.com/ticket/8066

Thanks again for all of your help.

KevinH

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Importing - Metadata aquisition	Justy	Calibre	1	02-05-2010 04:44 PM
why does html appears as Zip?	yasmeen57	Calibre	6	10-06-2009 12:25 PM
regex Issue when Importing	river	Calibre	3	06-16-2009 12:03 PM
Multiple html issue - too many links and zip isn't created in calibre	Katelyn	Calibre	1	03-10-2009 02:31 PM
Conversion issue with zip of Warbreaker	Mitchll	Calibre	6	07-28-2008 07:25 PM

12-25-2010, 07:13 PM	#16
KevinH Sigil Developer Posts: 9,272 Karma: 6565382 Join Date: Nov 2009 Device: many	Hi, FYI: a Topaz Input plugin would be a one way street. There really is no way to convert any other xhtml based file to the Topaz file format. The inputs required are really scanning based (list of glyphs and paths to create each glyph, x,y positions of each glyph on the page (and the glyphs are not like font glyphs as they have no baselines), ocr info, page continuation info, dehyphenation info, fixed page format, etc,. So I actually think it would be easier to create a .tpzZ input plugin that would take the archive with html, cover.jpeg, opf that I have already generated and handle things. The generated html is xhtml which can be input directly into an lxml tree without need for tidy or beautiful soup so it is very close to your internal calibre format as it stands right now. It is almost as if we need a new file type extension based on zip called ".calibre" that represents Calibre's internal format and I could modify the plugin code I have now to write to that standard and calibre could output to that code as well when exporting to disk. Kevin

12-25-2010, 07:44 PM	#17
kovidgoyal creator of calibre Posts: 45,978 Karma: 29579516 Join Date: Oct 2006 Location: Mumbai, India Device: Various	just use zip in that case. If you want the cover to import, modify the zip metadata reader to read covers from OPF files, IIRC the code is in metadata.archive

12-25-2010, 10:02 PM	#18
KevinH Sigil Developer Posts: 9,272 Karma: 6565382 Join Date: Nov 2009 Device: many	Hi, Forgive me if I am messed up as I am new to this code but I think metadata.archive falls back to metadata.zip if there is no known type inside the archive (unless a comic). And it looks like metadata/zip.py already will detect an .opf file if inside the zip and will invoke meta.get_metadata(stream,'opf') which in turn will invoke opf_metadata(path) which will parse it properly including the cover information from the manifest and the guide. The issue is that since libprs is not forced to be True and application_id is None since this book is not part of calibre yet), meta.get_Metadata(stream,stream_type) will not return the opf information just collected, but will instead try to drag meta information from the filename and just before returning will do a base.smart_update(opf) with the opf meta information. The issue is that opf.smart.update() will not update the cover or cover_data attributes as they are not on the list of attributes it will try to smartly update. Unless this would mess you up, the easiest fix would be to do something along these lines (I think) --- opf2.py 2010-12-23 16:39:20.000000000 -0500 +++ opf2_new.py 2010-12-25 20:47:37.000000000 -0500 @@ -990,7 +990,7 @@ for attr in ('title', 'authors', 'author_sort', 'title_sort', 'publisher', 'series', 'series_index', 'rating', 'isbn', 'tags', 'category', 'comments', - 'pubdate'): + 'pubdate', 'cover', 'cover_data'): val = getattr(mi, attr, None) if val is not None and val != [] and val != (None, None): setattr(self, attr, val) Am I understanding this correctly?

12-25-2010, 10:04 PM	#19
kovidgoyal creator of calibre Posts: 45,978 Karma: 29579516 Join Date: Oct 2006 Location: Mumbai, India Device: Various	open a ticket for it as I am travelling for the next few days so this post will get lost.

12-25-2010, 10:16 PM	#20
KevinH Sigil Developer Posts: 9,272 Karma: 6565382 Join Date: Nov 2009 Device: many	Hi, Will do. Have fun on your travels. Thanks again for all of your help. Kevin

12-26-2010, 12:57 AM	#21
KevinH Sigil Developer Posts: 9,272 Karma: 6565382 Join Date: Nov 2009 Device: many	Hi, Please ignore my previous attempt at a patch. The code I thought was running did not because it was protected by an application_id not None check that had to be worked around. I created a new patch to metadata/zip.py that I tested and it does do what I wanted. As requested I have created a bug tracker issue with the patch attached. http://bugs.calibre-ebook.com/ticket/8066 Thanks again for all of your help. KevinH

Advert

Advert