Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Kindle Formats

Notices

Reply
 
Thread Tools Search this Thread
Old 07-19-2011, 07:26 PM   #76
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,469
Karma: 5432724
Join Date: Nov 2009
Device: many
Hi Steffen,

Okay, here is a slightly revised version of what you did. I must admit my image name replacement is slower than yours but still much faster than the old version. If need be we can condition this code on if " processing dictionary" or not and add back in your fixed image file extension version simply for pure speed.

I called it version v0.28 to differentiate it. If it works okay for you, we can then integrate it into your git repository
Attached Files
File Type: zip mobiunpack_v0.28.zip (12.3 KB, 414 views)

Last edited by KevinH; 07-19-2011 at 07:27 PM. Reason: fix typos
KevinH is offline   Reply With Quote
Old 07-20-2011, 01:44 AM   #77
pdurrant
The Grand Mouse 高貴的老鼠
pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.
 
pdurrant's Avatar
 
Posts: 71,367
Karma: 305065800
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
If anyone has a good suggestion for how to fix the problem of loss of multiple metadata entries, I'd love to hear it. (i.e. if there's more than one author listed, we only save and write out one of them.)
pdurrant is offline   Reply With Quote
Advert
Old 07-20-2011, 05:19 AM   #78
siebert
Developer
siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.
 
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
Quote:
Originally Posted by pdurrant View Post
If anyone has a good suggestion for how to fix the problem of loss of multiple metadata entries, I'd love to hear it. (i.e. if there's more than one author listed, we only save and write out one of them.)
I'm not sure how multiple metadata entries are stored in the mobi file, but I would assume that it has just multiple entries with the same id?

Then it should be as easy as storing a list of strings instead of a single string in metadata[name] and for the output just iterate over the list.

Ciao,
Steffen
siebert is offline   Reply With Quote
Old 07-20-2011, 05:30 AM   #79
pdurrant
The Grand Mouse 高貴的老鼠
pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.
 
pdurrant's Avatar
 
Posts: 71,367
Karma: 305065800
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
Quote:
Originally Posted by siebert View Post
I'm not sure how multiple metadata entries are stored in the mobi file, but I would assume that it has just multiple entries with the same id?

Than it should be as easy as storing a list of strings instead of a single string in metadata[name] and for the output just iterate over the list.

Ciao,
Steffen
Oh, of course! Perhaps we could just store all metadata as a list, rather than a single object in the map.

Anyone better than me at Python like to give it a go?
pdurrant is offline   Reply With Quote
Old 07-20-2011, 05:33 AM   #80
siebert
Developer
siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.
 
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
Quote:
Originally Posted by KevinH View Post
Hi Steffen,
I called it version v0.28 to differentiate it. If it works okay for you, we can then integrate it into your git repository
I'm not so happy that you disabled skipping over sections which has been recognized as non-image sections.

As far as I know these non-image sections appear only at the end after all images, never between images, so it should not confuse the image name index if we just skip over them.

In that case we should also handle image sections where we can't determine the type as an error and print some message to stdout.

I think it might be even faster to just search all <img> tags in the source and then merge the source with the replaced <img> tags like we do the merge with in the "apply dictionary metadata and anchors" section.

Of course we have to skip over the original <img> tags in this merge, but as the match objects contain the position information where the found string was located, it should be very easy.

Since we have only the list of <img> tags in memory instead of all the splitted source file data, I expect that solution to be faster than your implementation.

Ciao,
Steffen
siebert is offline   Reply With Quote
Advert
Old 07-20-2011, 05:47 AM   #81
pdurrant
The Grand Mouse 高貴的老鼠
pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.
 
pdurrant's Avatar
 
Posts: 71,367
Karma: 305065800
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
Quote:
Originally Posted by siebert View Post
As far as I know these non-image sections appear only at the end after all images, never between images, so it should not confuse the image name index if we just skip over them.
Some Mobipocket files that have been edited with the Perl tools may have images after the non-image bits at the end.
pdurrant is offline   Reply With Quote
Old 07-20-2011, 10:11 AM   #82
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,469
Karma: 5432724
Join Date: Nov 2009
Device: many
Hi Steffen,

> I'm not so happy that you disabled skipping over sections which has been recognized as non-image sections.

I am not sure what your concern here is. A file is only created if it is a known image type. The remaining code in the loop simply invokes imghdr which just looks at select bytes strings near the front of the data string (very much like what you are doing, so it should be very fast) and then appends an placeholder to a list. Nothing here will impact processing time much if at all versus your version.

> As far as I know these non-image sections appear only at the end after all images, never between images, so it should not confuse the image name index if we just skip over them.

As Paul indicated, this may not be the case so this version is safer.

> In that case we should also handle image sections where we can't determine the type as an error and print some message to stdout.

Feel free to add that if you like. My main concern was properly adding the image filename extensions so that later post processing to xhtml works properly (ie. for those not using kindlegen or mobipocket create)

> I think it might be even faster to just search all <img> tags in the source and then merge the source with the replaced <img> tags like we do the merge with in the "apply dictionary metadata and anchors" section.

That is similar to what is happening here. Regular expressions are used to split which breaks up the string into segments where all of the odd pieces 1,3,5,7 are the img tags and the even pieces are everything else before or after.

Then when we do replacements all we are doing is dropping an element from the list and replacing it and we only process the img tags themselves. So no need to create and delete 26mb-100mb copies all of the time. And then you simply put it back together using join.

> Of course we have to skip over the original <img> tags in this merge, but as the match objects contain the position information where the found string was located, it should be very easy.

> Since we have only the list of <img> tags in memory instead of all the splitted source file data, I expect that solution to be faster than your implementation.

Makes sense. Please feel free to make any changes you like. I only have one old dictionary to test with and so can't really fine tune it much. If your way is faster and keeps the proper image file name extensions, I am all for it.

Once we have that stable, I am going to test timewise comparing FastConcat with hugeFile set to FastConcat without to see how much of a penalty it is to do everything in memory but with lists of string segments and not one huge string constantly being added to.

Take care,

Kevin
KevinH is offline   Reply With Quote
Old 07-20-2011, 12:50 PM   #83
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,469
Karma: 5432724
Join Date: Nov 2009
Device: many
Hi,

For fun ... I ran mobiunpack_v0.28.py on my one dictionary (file size is 27,585,020 bytes) and timed it (clock time from date in shell script both before and after mobiunpack) and then hard coded hugeFile to False and re-ran.

With hugeFile set as True: (uses file IO to temporary files)

Run Start Stop Elapsed Time
1 12:25:21 12:26:39 1 minute 18 seconds
2 12:26:45 12:28:02 1 minute 17 seconds

With hugeFile set as False (uses lists of strings and "".join(strlist)

Run Start Stop Elapsed Time
1 12:29:18 12:30:32 1 minute 14 seconds
2 12:30:38 12:31:53 1 minute 15 seconds

It was as I expected. There is no "memory issue" when using lists of strings.
In most OS's File IO has overhead and typically writes data to large memory buffers (buffered io) and does not actually flush them to disk unless pushed or until closed. So any slight savings in memory use is offset by the disk overhead.

So it appears there is no real advantage for using temporary file IO over using lists of strings and a final join.

Please try the same thing with your dictionaries and see if you get the same results. If so, we can probably remove the file io approach and remove FactConcat and just go with the string list approach.

Thanks,

Kevin
KevinH is offline   Reply With Quote
Old 07-20-2011, 03:45 PM   #84
siebert
Developer
siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.
 
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
Quote:
Originally Posted by KevinH View Post
Please try the same thing with your dictionaries and see if you get the same results. If so, we can probably remove the file io approach and remove FactConcat and just go with the string list approach.
I'll do. I've also implemented my proposed handling of the image tags, but I'm not any longer sure that it should be faster than your implementation, but I'll do some measurements on this variant, too.

Ciao,
Steffen
siebert is offline   Reply With Quote
Old 07-20-2011, 03:50 PM   #85
siebert
Developer
siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.
 
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
Quote:
Originally Posted by KevinH View Post
I am not sure what your concern here is. A file is only created if it is a known image type. The remaining code in the loop simply invokes imghdr which just looks at select bytes strings near the front of the data string (very much like what you are doing, so it should be very fast) and then appends an placeholder to a list. Nothing here will impact processing time much if at all versus your version.
My concern wasn't speed but that it's not very elegant to search for image headers in sections already known to not contain images.

Quote:
As Paul indicated, this may not be the case so this version is safer.
I would consider such files to be broken, especially if the additional image sections occur after the EOF-section. But if such file exists we should be able to decode them, of course.

So I've changed the code to skip non-image sections again but still work for such broken files.

Ciao,
Steffen
siebert is offline   Reply With Quote
Old 07-20-2011, 03:54 PM   #86
pdurrant
The Grand Mouse 高貴的老鼠
pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.
 
pdurrant's Avatar
 
Posts: 71,367
Karma: 305065800
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
Quote:
Originally Posted by siebert View Post
So I've changed the code to skip non-image sections again but still work for such broken files.
That does seem more elegant. I'm really pleased to see work being done on this useful script again.
pdurrant is offline   Reply With Quote
Old 07-20-2011, 04:05 PM   #87
siebert
Developer
siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.
 
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
Quote:
Originally Posted by pdurrant View Post
Oh, of course! Perhaps we could just store all metadata as a list, rather than a single object in the map.

Anyone better than me at Python like to give it a go?
I've started to implement storing metadata as lists, and while it's not pretty it seems to work, though I haven't yet tested a file which actually contains duplicate metadata tags.

But I've noticed that several tags are currently not handled by mobiunpack (e.g. 202-209, 300).

I'm would like to get some input about how mobiunpack should handle them. I doubt that mobigen/kindlegen supports all these tags (if any), but there are already tags that will be exported to the opf file despite they are ignored by mobigen/kindlegen (e.g. the ASIN).

Are there other tools which actually support these tags and use the values or are they just for information?

In the latter case I would like to mark them as such (for example by putting them into a comment section) to make clear that their value won't affect the generated mobi.

Another solution would be to define a new list of ignored tags, so it's clear that we are aware of those tags but deliberately don't include them in the opf file.

Ciao,
Steffen
siebert is offline   Reply With Quote
Old 07-20-2011, 04:16 PM   #88
pdurrant
The Grand Mouse 高貴的老鼠
pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.
 
pdurrant's Avatar
 
Posts: 71,367
Karma: 305065800
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
Quote:
Originally Posted by siebert View Post
But I've noticed that several tags are currently not handled by mobiunpack (e.g. 202-209, 300).

I'm would like to get some input about how mobiunpack should handle them. I doubt that mobigen/kindlegen supports all these tags (if any), but there are already tags that will be exported to the opf file despite they are ignored by mobigen/kindlegen (e.g. the ASIN).

Are there other tools which actually support these tags and use the values or are they just for information?

In the latter case I would like to mark them as such (for example by putting them into a comment section) to make clear that their value won't affect the generated mobi.

Another solution would be to define a new list of ignored tags, so it's clear that we are aware of those tags but deliberately don't include them in the opf file.
I think that idea of exporting all the information in the EXTH tags, even if only as comments, is a very good one.

We could have a list of tags for export as comments, where we have some idea of what the tags mean, and then also do a simple dump into comments of any completely unknown tags.

The plan (if it can be called that) behind the opf generation was to add as much info from the EXTH as possible that was valid in an OPF file, whether of not KindleGen would use it.

I'm looking forward to seeing what you come up with. I do have some test files with multiple authors.
pdurrant is offline   Reply With Quote
Old 07-22-2011, 12:12 PM   #89
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 7,469
Karma: 5432724
Join Date: Nov 2009
Device: many
Hi,

Instead of making all metadata elements lists which is a bit messy code wise (especially for something that is not a common event) it may be easier and cleaner to check if a value with that key already exists and if so appending a string delimiter (can be any unique identifier string we want - '"&#$%" or whatever) and then add the new data to the end. That if there is only 1 author or many authors, all data is stored in a simple string in the metadata dictionary.

Clean and easy to do using .get(key, '"") on the key to return either the current value for that key or the null string, if not null you append the string delimiter, then you just append the new value for the key. It also works with encoding to utf-8 quite easily.

When we go to write it out, simply split on the string delimiter and write out each one. If there is no delimiter present in the string , you will only write out 1.

As for keeping all values for metadata, I am for that but we need to be careful in that some mobs will have binary data in some metadata values (left over from keys previously used for DRM, etc) and we can run into byte values that do not exist in utf-8. So we may want to hex or base64 encode these values if you want to maintain them in some way.

My two cents,

Kevin
KevinH is offline   Reply With Quote
Old 07-22-2011, 12:56 PM   #90
siebert
Developer
siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.siebert has a complete set of Star Wars action figures.
 
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
Quote:
Originally Posted by KevinH View Post
Hi,

Instead of making all metadata elements lists which is a bit messy code wise (especially for something that is not a common event) it may be easier and cleaner to check if a value with that key already exists and if so appending a string delimiter (can be any unique identifier string we want - '"&#$%" or whatever) and then add the new data to the end.
Sorry, but using strings with delimiters would be a very unpythonic solution.

One might implement a solution to use strings for single values and a list of strings only if multiple values exist and use type() to distinguish both cases, but I've refactored my all-list solution already to be usable.

I'm almost done (the temporary file code was also removed), do you want me to just publish it when its finished, or do you want to take a look before (let me know your email address then)?

Quote:
As for keeping all values for metadata, I am for that but we need to be careful in that some mobs will have binary data in some metadata values (left over from keys previously used for DRM, etc) and we can run into byte values that do not exist in utf-8. So we may want to hex or base64 encode these values if you want to maintain them in some way.
I decided to have a list of types to ignore (so far 209, 300 and 403), as the content is unprintable and of very little interest. The values of all other supported tags are supposed to be printable.

By having a list for them the code can now warn about any unknown tag it might occur.

Ciao,
Steffen
siebert is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Can i rotate text and insert images in Mobi and EPUB? JanGLi Kindle Formats 5 02-02-2013 04:16 PM
PDF to Mobi with text and images pocketsprocket Kindle Formats 7 05-21-2012 07:06 AM
Mobi files - images DWC Introduce Yourself 5 07-06-2011 01:43 AM
pdf to mobi... creating images rather than text Dumhed Calibre 5 11-06-2010 12:08 PM
Transfer of images on text files anirudh215 PDF 2 06-22-2009 09:28 AM


All times are GMT -4. The time now is 06:35 AM.


MobileRead.com is a privately owned, operated and funded community.