MobileRead Forums - View Single Post - KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files

KevinH · 07-20-2011, 10:11 AM

Hi Steffen,

> I'm not so happy that you disabled skipping over sections which has been recognized as non-image sections.

I am not sure what your concern here is. A file is only created if it is a known image type. The remaining code in the loop simply invokes imghdr which just looks at select bytes strings near the front of the data string (very much like what you are doing, so it should be very fast) and then appends an placeholder to a list. Nothing here will impact processing time much if at all versus your version.

> As far as I know these non-image sections appear only at the end after all images, never between images, so it should not confuse the image name index if we just skip over them.

As Paul indicated, this may not be the case so this version is safer.

> In that case we should also handle image sections where we can't determine the type as an error and print some message to stdout.

Feel free to add that if you like. My main concern was properly adding the image filename extensions so that later post processing to xhtml works properly (ie. for those not using kindlegen or mobipocket create)

> I think it might be even faster to just search all <img> tags in the source and then merge the source with the replaced <img> tags like we do the merge with in the "apply dictionary metadata and anchors" section.

That is similar to what is happening here. Regular expressions are used to split which breaks up the string into segments where all of the odd pieces 1,3,5,7 are the img tags and the even pieces are everything else before or after.

Then when we do replacements all we are doing is dropping an element from the list and replacing it and we only process the img tags themselves. So no need to create and delete 26mb-100mb copies all of the time. And then you simply put it back together using join.

> Of course we have to skip over the original <img> tags in this merge, but as the match objects contain the position information where the found string was located, it should be very easy.

> Since we have only the list of <img> tags in memory instead of all the splitted source file data, I expect that solution to be faster than your implementation.

Makes sense. Please feel free to make any changes you like. I only have one old dictionary to test with and so can't really fine tune it much. If your way is faster and keeps the proper image file name extensions, I am all for it.

Once we have that stable, I am going to test timewise comparing FastConcat with hugeFile set to FastConcat without to see how much of a penalty it is to do everything in memory but with lists of string segments and not one huge string constantly being added to.

Take care,

Kevin

07-20-2011, 10:11 AM	#82
KevinH Sigil Developer Posts: 8,887 Karma: 6120478 Join Date: Nov 2009 Device: many	Hi Steffen, > I'm not so happy that you disabled skipping over sections which has been recognized as non-image sections. I am not sure what your concern here is. A file is only created if it is a known image type. The remaining code in the loop simply invokes imghdr which just looks at select bytes strings near the front of the data string (very much like what you are doing, so it should be very fast) and then appends an placeholder to a list. Nothing here will impact processing time much if at all versus your version. > As far as I know these non-image sections appear only at the end after all images, never between images, so it should not confuse the image name index if we just skip over them. As Paul indicated, this may not be the case so this version is safer. > In that case we should also handle image sections where we can't determine the type as an error and print some message to stdout. Feel free to add that if you like. My main concern was properly adding the image filename extensions so that later post processing to xhtml works properly (ie. for those not using kindlegen or mobipocket create) > I think it might be even faster to just search all <img> tags in the source and then merge the source with the replaced <img> tags like we do the merge with in the "apply dictionary metadata and anchors" section. That is similar to what is happening here. Regular expressions are used to split which breaks up the string into segments where all of the odd pieces 1,3,5,7 are the img tags and the even pieces are everything else before or after. Then when we do replacements all we are doing is dropping an element from the list and replacing it and we only process the img tags themselves. So no need to create and delete 26mb-100mb copies all of the time. And then you simply put it back together using join. > Of course we have to skip over the original <img> tags in this merge, but as the match objects contain the position information where the found string was located, it should be very easy. > Since we have only the list of <img> tags in memory instead of all the splitted source file data, I expect that solution to be faster than your implementation. Makes sense. Please feel free to make any changes you like. I only have one old dictionary to test with and so can't really fine tune it much. If your way is faster and keeps the proper image file name extensions, I am all for it. Once we have that stable, I am going to test timewise comparing FastConcat with hugeFile set to FastConcat without to see how much of a penalty it is to do everything in memory but with lists of string segments and not one huge string constantly being added to. Take care, Kevin