MobileRead Forums - View Single Post - Issue importing html zip archives and metadata parsing

KevinH · 12-25-2010, 04:25 PM

Hi,

Although, non-drm'd Topaz could exist, they do not seem to in the wild. The current "tools" do not take that approach since a non-drm Topaz originally could not easily be converted because it is internally a binary encoded data file that is really a poor man's version of an image only pdf file with some ocr info added to make searching possible.

So a non-DRM Topaz file was really only good for sharing/piracy as having it changes nothing for the owner, they could read it only on Kindles before and could only read it on Kindles after -- all the DRM removal accomplished was to allow owners to post/share the file with others (something the tools authors did not want to support).

The binary encoded data file itself needs to be converted to an incompletely reverse engineered xml using a dictionary lookup procedure, the custom xml then needs to be parsed, and the information which describes the image of the page needs to be combined with the internal OCR info to create something that is html based but unfortunately imperfect (the internal ocr can be horrible and all italics and most bolding is lost). The same binary data files can also be converted to a set of svg images of the page (perfect and scalable but not reflowable, unless you have an algorithm to reflow individual glyphs which need not map to any specific letter on the screen)

So are you saying, that if the "tools" were somehow reverted to do nothing other than generate non-DRM topaz files, we could move all of the reversed engineered python code that was added later that handles the conversion of the file to html and a set of svg images right into calibre itself?

I was not sure you would allow Calibre to host code that was reverse engineered. If so, we could certainly take that approach.

My original idea was to create a file plugin that handled the "non-drm part" and the detailed conversion behind the scenes and then handed calibre the results of the conversion as one nice package of some sort - say a .tpzZ (for zip) file so that nothing internal to calibre need change except for adding a pseudo-file type (.tpzZ) type support which I was going to write and contribute to calibre so that no reverse-engineered code need be included.

If you really would like to host the internal conversion code, I would be happy to contribute it and the authors of the standalone "tools" could revert to just creating non-drm Topaz files.

It is really your choice. If you are interested, I will can take the latest versions, strip out the drm removal code pieces, and just send you just a working converter program for you to play around with (pure python). I just thought that "all in a plugin" would be the safest approach.

Take care,

KevinH

12-25-2010, 04:25 PM	#13
KevinH Sigil Developer Posts: 8,893 Karma: 6120478 Join Date: Nov 2009 Device: many	Hi, Although, non-drm'd Topaz could exist, they do not seem to in the wild. The current "tools" do not take that approach since a non-drm Topaz originally could not easily be converted because it is internally a binary encoded data file that is really a poor man's version of an image only pdf file with some ocr info added to make searching possible. So a non-DRM Topaz file was really only good for sharing/piracy as having it changes nothing for the owner, they could read it only on Kindles before and could only read it on Kindles after -- all the DRM removal accomplished was to allow owners to post/share the file with others (something the tools authors did not want to support). The binary encoded data file itself needs to be converted to an incompletely reverse engineered xml using a dictionary lookup procedure, the custom xml then needs to be parsed, and the information which describes the image of the page needs to be combined with the internal OCR info to create something that is html based but unfortunately imperfect (the internal ocr can be horrible and all italics and most bolding is lost). The same binary data files can also be converted to a set of svg images of the page (perfect and scalable but not reflowable, unless you have an algorithm to reflow individual glyphs which need not map to any specific letter on the screen) So are you saying, that if the "tools" were somehow reverted to do nothing other than generate non-DRM topaz files, we could move all of the reversed engineered python code that was added later that handles the conversion of the file to html and a set of svg images right into calibre itself? I was not sure you would allow Calibre to host code that was reverse engineered. If so, we could certainly take that approach. My original idea was to create a file plugin that handled the "non-drm part" and the detailed conversion behind the scenes and then handed calibre the results of the conversion as one nice package of some sort - say a .tpzZ (for zip) file so that nothing internal to calibre need change except for adding a pseudo-file type (.tpzZ) type support which I was going to write and contribute to calibre so that no reverse-engineered code need be included. If you really would like to host the internal conversion code, I would be happy to contribute it and the authors of the standalone "tools" could revert to just creating non-drm Topaz files. It is really your choice. If you are interested, I will can take the latest versions, strip out the drm removal code pieces, and just send you just a working converter program for you to play around with (pure python). I just thought that "all in a plugin" would be the safest approach. Take care, KevinH