MobileRead Forums - View Single Post - KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files

KevinH · 12-28-2012, 12:37 PM

Hi Sergey,

Your version is a bit older than my version as line numbers do not match up.
Did you use Mobi_Unpack v59 or an earlier version?

> 1. PalmdocReader misses the case where c == 0

Doesn't the case c < 128 handle this? What am I missing?

> 2. MobiHeader.__init__() assignes self.othidx = 0xfffffff instead of 0xffffffff

fixed

> 3. getLanguage 26 has two entries.

fixed: merged into single table entry

> 4. mobi_unpack.py:727 "# bytes 19 - 23: start of xor string" => "# bytes 20 - 23: start of xor string"

typo fixed

> 5. mobi_k8proc.__init__() adds 0xfffffff (instead of 0xffffffff) to the end of self.fdsttbl list

this was already fixed in my version

> 6. getVariableWidthValue() and readTagSection() are defined twice: in mobi_utils.py and mobi_index.py

moved to mobi_index and removed from mobi_utils

> 7. countSetBits() is defined twice: in mobi_index.py and mobi_dict.py; It doesn't need to be a member functions.

changed to non-member function in mobi_index and removed from mobi_dict

> 8. the same with getTagMap().

changed to non-member function in mobi_index and removed from mobi_dict

> 9. This is not the most optimum way to write countSetBits(). See bellow (sorry for C#).

mask and shift version is normal way (easily understood) to do this and works well (not the bottleneck in execution speed)

> 10. num += 1 at the end of parseNCX() is redundant

fixed: removed last line

> 11. re: '''<[^<>]+filepos=['"]{0,1}(\d+)[^<>]*>'''. What about closing quite or apos?

this is fine, we are just capturing the digits and any closing ['"] captured by [^<>]*

> 12. re: join removing empty anchors in sigle substitution. Existent re doesn't handle all all possible WS.

I am not sure about this one. Exactly which file and which line are you talking about here? Can you give me more specifics?

> 13. mobi_opf.py:127. print format parameters are missing.
> 14. mobi_opf.py:51 escape() function is for escaping HTML text nodes not attribute values. It doesn't escape " to "

mobi_opf has recently been rewritten to properly escape things so I think this has been taken care of.

> 15. mobi_opf.py:222 tries to find 'StartOffset' in the metadata. This is hopeless becuase all keys including ('StartOffset') ware deleted at line 154.

yes, as was noted in the code, this one fixed by adding self.starting_offset which is initialized as None and set when processing the first time so it is available later

So I think all of these changes have been made except for your number 12 and possibly for your number 1. Can you add some more detail there?

Thanks,

KevinH

12-28-2012, 12:37 PM	#458
KevinH Sigil Developer Posts: 7,644 Karma: 5433388 Join Date: Nov 2009 Device: many	Thanks for Your Bug Report Hi Sergey, Your version is a bit older than my version as line numbers do not match up. Did you use Mobi_Unpack v59 or an earlier version? > 1. PalmdocReader misses the case where c == 0 Doesn't the case c < 128 handle this? What am I missing? > 2. MobiHeader.__init__() assignes self.othidx = 0xfffffff instead of 0xffffffff fixed > 3. getLanguage 26 has two entries. fixed: merged into single table entry > 4. mobi_unpack.py:727 "# bytes 19 - 23: start of xor string" => "# bytes 20 - 23: start of xor string" typo fixed > 5. mobi_k8proc.__init__() adds 0xfffffff (instead of 0xffffffff) to the end of self.fdsttbl list this was already fixed in my version > 6. getVariableWidthValue() and readTagSection() are defined twice: in mobi_utils.py and mobi_index.py moved to mobi_index and removed from mobi_utils > 7. countSetBits() is defined twice: in mobi_index.py and mobi_dict.py; It doesn't need to be a member functions. changed to non-member function in mobi_index and removed from mobi_dict > 8. the same with getTagMap(). changed to non-member function in mobi_index and removed from mobi_dict > 9. This is not the most optimum way to write countSetBits(). See bellow (sorry for C#). mask and shift version is normal way (easily understood) to do this and works well (not the bottleneck in execution speed) > 10. num += 1 at the end of parseNCX() is redundant fixed: removed last line > 11. re: '''<[^<>]+filepos=['"]{0,1}(\d+)[^<>]>'''. What about closing quite or apos? this is fine, we are just capturing the digits and any closing ['"] captured by [^<>] > 12. re: join removing empty anchors in sigle substitution. Existent re doesn't handle all all possible WS. I am not sure about this one. Exactly which file and which line are you talking about here? Can you give me more specifics? > 13. mobi_opf.py:127. print format parameters are missing. > 14. mobi_opf.py:51 escape() function is for escaping HTML text nodes not attribute values. It doesn't escape " to " mobi_opf has recently been rewritten to properly escape things so I think this has been taken care of. > 15. mobi_opf.py:222 tries to find 'StartOffset' in the metadata. This is hopeless becuase all keys including ('StartOffset') ware deleted at line 154. yes, as was noted in the code, this one fixed by adding self.starting_offset which is initialized as None and set when processing the first time so it is available later So I think all of these changes have been made except for your number 12 and possibly for your number 1. Can you add some more detail there? Thanks, KevinH