MobileRead Forums - View Single Post - KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files

KevinH · 06-18-2014, 12:22 PM

Hi tkeo,

I looked at your patch from v067 to add epub3 support and your comments.

I have a few things that I don't understand and therefore would like to discuss:

- why do you want to move writing of files back to process_all_mobi_headers?
In general, an object should know how and where it should write itself. Therefore, NAV, OPF, NCX etc should all know where and how to create themselves from the passed in data and the "files" object.

In fact, my inclination is to move the writing of the text/html files out of even processMobiX and out to header specific routines.

- your partslist[] simply duplicates much of what is in k8proc so I don't understand why it is needed. We can simply pass the k8proc object along if you need access to that information.

- you seem to duplicate information from k8proc to pull into k8resc. It would simply be easier to pass in the k8proc object is and where you need that information.

- your datalist[] simply duplicates much of what imgnames is used for and you never store any raw data in that list anyway. Why is the data offset information there if the data itself is never stored? Why do you need the section number? The directory and type information can easily be deduced from the file name extension, and metadata for (offsets).

- why do we need to know the width and height of the images? Is this needed to create svg based cover pages?

In general, I think it is easy in the opf to know where something is when we know what name and extension it has because we pass in our unpack structure files object and imgnames list.

Now if you really think you must have additional information, lets focus on reusing the data structures in k8proc for the text, css, and svg pieces and not add or use partslist[].

We could also simply rename and imgnames to resource_names or something similar since they can deal with fonts and the like, although a simple directory listing of the output file structure can tell you everything you need to know as well.

At minimum, resource_names would need, filename with extension (which is what it has now) and from the extension, the opf would know where it should be located and what it is.

But if you need more we should go for something as minimal as possible and not reinvent yet another data structure to pass around. From what I could see I simply do not think we need the section number, data offset, or data itself, ever. So let's figure out the very minimum needed and use that.

- Also do we really need to rename the cover image as "cover"?

- Also Can't we simply use the EmptyImagePlaceholders and the kindleembed string from the "kind" section and info from CONT metadata to overwrite the the corresponding non-HD image file with the correct HDImage but keep its original name so nothing in the OPF needs to care.

The CONT and following sections up to the container boundary is a simple one-to-one mapping from the first image to the last HDImage with empty place holders being used to indicate images that do not have HD replacements (so they only need to keep track of things up to the last HDImage.

Please let me know what you think. We can discuss this further via personal messaging on this site so as to not spam the list if you so desire.

Take care and Thanks for all of your hard work!

KevinH

Quote:

Originally Posted by tkeo

Kevin,

I would like to ask you for a modification about refactoring.

Could you allow me

to remove the imgnames pamareter and change the return value from imgnames to imginfo whose structure is [dir, imgname, type, secno, dataoffset, data(=None)], to functions listed below, and appending it in process_all_mobi_headers(),

or

to change the parameter from imgnames to imglist(= list of the imginfo)?

The list of functions to modify:

processSRCS(), processPAGE(), processCMET(), processFONT(), pocessCRES(), processCONT(), processkind(), processRESC(), processImage().

Because I am considering to move all calling of write() except for DUMP into process_all_mobi_headers() in order to make easier to understand and writing files to mobi8 folder directly instead of copying files from mobi7 folder, in addition, creating epub files from the imglist.
I think it will make easy to support HD images. Since to make the epub for the HD images, recreating XHTML files or renaming the file names of the HD images are required.
I attach the newest preview version I have, as a reference.

Thanks,