KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files - Page 32

KevinH · 12-31-2012, 08:48 PM

Hi Sergey,

If you are not seeing the correct characters in the Log window when running the GUI, please try replacing the following class in Mobi_Unpack.pyw with the following:

Code:

# Wrap a stream so that output gets appended to shared queue
# using utf-8 encoding
class QueuedStream:
    def __init__(self, stream, q):
        self.stream = stream
        self.encoding = stream.encoding
        self.q = q
        if self.encoding == None:
            self.encoding = 'utf-8'
    def write(self, data):
        if isinstance(data,unicode):
            data = data.encode('utf-8',"replace")
        elif self.encoding != 'utf-8':
            udata = data.decode(self.encoding)
            data = udata.encode('utf-8', "replace")
        self.q.put(data)
    def __getattr__(self, attr):
        return getattr(self.stream, attr)

This should decode the stdout from the mobi_unpack.py (which will be in your local Russian code page) and encode it into utf-8 so that it should get written properly to the Log window (hopefully).

Please let me know if this helps.

Thanks,

KevinH

Sergey Dubinets · 01-01-2013, 07:19 PM

To: DiapDealer about quoteattr().
quoteattr() doesn't change " to ' and back in attribute value if you mean this. If attribute value doesn't have " quoteattr() would put it into "" without additional encoding. The same with '. If both ' and " are present in the value quoteattr() would replace " to " and use " around.

If you wish for some reason always put attribute values into "" you can escape ". There is no need to escape ' in this case.

Sergey Dubinets · 01-01-2013, 07:39 PM

To DiapDealer about double unescaping.
It is not that innocent as it can appear.
Of course if value doesn't contain any '&' additional unsnapping would not do any harm.
The problem happens when unescaped value contains known entity. For example is title of the article is "Don't double unescape & in metadata".
Escaped string would be "Don't double unescape &amp; in metadata".
If you unescape it twice or unescape original string you would get "Don't double unescape & in metadata" and this is not what original title was.

In short: double unescaping is a bug and it results in "data loss".

DiapDealer · 01-01-2013, 10:10 PM

Quote:

Originally Posted by Sergey Dubinets

To DiapDealer about double unescaping.
It is not that innocent as it can appear.
Of course if value doesn't contain any '&' additional unsnapping would not do any harm.
The problem happens when unescaped value contains known entity. For example is title of the article is "Don't double unescape & in metadata".
Escaped string would be "Don't double unescape &amp; in metadata".
If you unescape it twice or unescape original string you would get "Don't double unescape & in metadata" and this is not what original title was.

In short: double unescaping is a bug and it results in "data loss".

I think you still might be mistaken. I'm not "double unescaping" anything. I'm first unescaping everything and then re-escaping only the three characters that must be escaped. Under the current code:

Code:

xmlescape(self.h.unescape(value)

HTMLParser.unescape() first takes: "Don't double unescape & in metadata".

And makes it: "Don't double unescape & in metadata".

Then saxutils.escape() makes it: "Don't double unescape & in metadata".

No data loss. And you can't create a mobi with kindlegen that preserves and displays the literal text "&" in the title anyway.

Your example is a perfect illustration of why I've chosen to do it the way I have. Without HTMLParser's initial unescape(), using the saxutils escape() method alone (which is required to handle any html tags or unescaped ampersands) would result in a valid "&" being turned into "&amp;". Just like you described.

The current method will preserve all pre-existing < > and & entities while converting any other entities encountered to their character representations and properly escaping any html tags and naked ampersands.

Sergey Dubinets · 01-02-2013, 12:32 AM

If value in the mobi file is html escaped you need to unescape it using HTML rules for processing and then escape it according XML rules when writing to XML file. As you do.

My statement was: you can't unescape "just in case" (because no harm was done.)
If metadata has escaped strings we have to unescape them.
If it has none-escaped strings we shouldn't do this.

pdurrant · 01-02-2013, 03:07 AM

Quote:

Originally Posted by Sergey Dubinets

My statement was: you can't unescape "just in case" (because no harm was done.)
If metadata has escaped strings we have to unescape them.
If it has none-escaped strings we shouldn't do this.

The problem we have in Mobiunpack is that there is no metadata in the mobipocket file to tell us whether the string has been escaped or not.

All we can do is pick the least bad option. Which, IMO, is to unescape the text. It is far, far more common that the text has been escaped than that the text is unescaped but with some apparently escaped entities.

KevinH · 01-02-2013, 09:17 PM

Hi,

Here is a new version of Mobi_Unpack (experimental v2) which has all known bug fixes in place.

Its primary new feature is a more robust GUI interface (Mobi_Unpack.pyw) that should better support international users on Windows with improved full unicode support for all file paths and file names.

This should be considered beta level software. I would really appreciate hearing back about any successes or failures. If it works well this version should become Mobi_Unpack_v061 final.

Thanks,

KevinH

Loceka · 01-07-2013, 02:17 PM

Hello,

I'm new to the forum and I've found that wonderful script(s) but I can't find out how to launch the mobi_dict.py script.

I've seen no specific option in the GUI and launching it using python(with or without arguments) has no effect:

Code:

python mobi_dict.py <in> <out>

So what's the way to use it ?

Thanks,
Loceka.

Doitsu · 01-07-2013, 02:48 PM

AFAIK, mobi_dict.py is a module that is automatically called by the main .pyw if a dictionary .mobi file is detected.
I.e. if you want to decompile a dictionary simply execute Mobi_Unpack.pyw. It'll automatically execute mobi_dict.py with the correct parameters.

DiapDealer · 01-07-2013, 02:55 PM

Quote:

Originally Posted by Loceka

Hello,

I'm new to the forum and I've found that wonderful script(s) but I can't find out how to launch the mobi_dict.py script.

I've seen no specific option in the GUI and launching it using python(with or without arguments) has no effect:

Code:

python mobi_dict.py <in> <out>

So what's the way to use it ?

Thanks,
Loceka.

The mobi_dict.py script is not meant to be invoked directly. The main script imports the necessary methods and classes from mobi_dict.py when it encounters a dictionary-type mobi. Note however, that dictionary support has always been very limited and quite fragile. There's a very good possibility that even if you get your dictionary successfully unpacked, you might never get it rebuilt into a functioning mobi dictionary again.

Loceka · 01-08-2013, 02:31 PM

Thank you both for your answers.
I was mistaken by the file name and thought it was meant to convert Mobi dictionaries to the DICT format, my bad.

The dictionnaries I tried to extract where mostly successfully extracted despite the errors in the logs :

Code:

Error: Dictionary contains multiple inflection index sections, which is not yet supported
Error: Dictionary uses obsolete inflection rule scheme which is not yet supported

I'm not quite sure about what they really mean (if something will be messed up in the extracted HTML file or not) but the output HTML file is not empty and seems ok.

Loceka · 01-09-2013, 05:05 PM

Well thank you all again for those scripts.

I've made one of my own (in Perl) that converts a mobi dictionary into a Kobo format dictionary.

Actually it may not be really useful because the ones I tried did not match the default Kobo dictionaries, but still it worked for me.

As for the script itself it must be launched as :

Code:

perl mobi2kobo.pl -i <input file> -o <output dir>

For the moment it is expecting a cp1252 (WinLatin1) encoded HTML Mobi file as input file. If the input file is UTF8 encoded, it should be changed in the source code.

It also requires some necessary third-party programs :

marisa in order to build the index
gzip to compress the ouput HTML files
zip to create the final archive

Therefore, for the moment it only works on Linux (possibly Mac OS) but it should not be too hard to have it working on Windows too.

holdit · 01-15-2013, 09:46 AM

Thx to all!

Hit

pdurrant · 01-17-2013, 03:05 AM

Version 0.61 has now been uploaded to the first post of the thread.

This includes all recent fixes for the scripts, and now should fully support the use of unicode file names, thanks to lots of work by KevinH.

With version 0.61 the name of the scipt has been changed to KindleUnpack, since almost all Mobipocket files are now actually Kindle files from Amazon, and the script certainly handles files that are not Mobipocket at all (KF8 and .azw4).

JSWolf · 01-17-2013, 07:51 PM

Now we just need the Calibre plugin updated.

12-31-2012, 08:48 PM	#466
KevinH Sigil Developer Posts: 7,727 Karma: 5444398 Join Date: Nov 2009 Device: many	Hi Sergey, If you are not seeing the correct characters in the Log window when running the GUI, please try replacing the following class in Mobi_Unpack.pyw with the following: Code: # Wrap a stream so that output gets appended to shared queue # using utf-8 encoding class QueuedStream: def __init__(self, stream, q): self.stream = stream self.encoding = stream.encoding self.q = q if self.encoding == None: self.encoding = 'utf-8' def write(self, data): if isinstance(data,unicode): data = data.encode('utf-8',"replace") elif self.encoding != 'utf-8': udata = data.decode(self.encoding) data = udata.encode('utf-8', "replace") self.q.put(data) def __getattr__(self, attr): return getattr(self.stream, attr) This should decode the stdout from the mobi_unpack.py (which will be in your local Russian code page) and encode it into utf-8 so that it should get written properly to the Log window (hopefully). Please let me know if this helps. Thanks, KevinH

01-01-2013, 07:19 PM	#467
Sergey Dubinets Junior Member Posts: 6 Karma: 10 Join Date: Dec 2012 Device: Kindle	To: DiapDealer about quoteattr(). quoteattr() doesn't change " to ' and back in attribute value if you mean this. If attribute value doesn't have " quoteattr() would put it into "" without additional encoding. The same with '. If both ' and " are present in the value quoteattr() would replace " to " and use " around. If you wish for some reason always put attribute values into "" you can escape ". There is no need to escape ' in this case. Last edited by Sergey Dubinets; 01-01-2013 at 07:23 PM.

01-07-2013, 02:17 PM	#473
Loceka Member Posts: 24 Karma: 10 Join Date: Jan 2013 Device: Kobo Glo	Hello, I'm new to the forum and I've found that wonderful script(s) but I can't find out how to launch the mobi_dict.py script. I've seen no specific option in the GUI and launching it using python(with or without arguments) has no effect: Code: python mobi_dict.py <in> <out> So what's the way to use it ? Thanks, Loceka.

01-08-2013, 02:31 PM	#476
Loceka Member Posts: 24 Karma: 10 Join Date: Jan 2013 Device: Kobo Glo	Thank you both for your answers. I was mistaken by the file name and thought it was meant to convert Mobi dictionaries to the DICT format, my bad. The dictionnaries I tried to extract where mostly successfully extracted despite the errors in the logs : Code: Error: Dictionary contains multiple inflection index sections, which is not yet supported Error: Dictionary uses obsolete inflection rule scheme which is not yet supported I'm not quite sure about what they really mean (if something will be messed up in the extracted HTML file or not) but the output HTML file is not empty and seems ok.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can i rotate text and insert images in Mobi and EPUB?	JanGLi	Kindle Formats	5	02-02-2013 04:16 PM
PDF to Mobi with text and images	pocketsprocket	Kindle Formats	7	05-21-2012 07:06 AM
Mobi files - images	DWC	Introduce Yourself	5	07-06-2011 01:43 AM
pdf to mobi... creating images rather than text	Dumhed	Calibre	5	11-06-2010 12:08 PM
Transfer of images on text files	anirudh215	PDF	2	06-22-2009 09:28 AM

01-01-2013, 07:39 PM	#468
Sergey Dubinets Junior Member Posts: 6 Karma: 10 Join Date: Dec 2012 Device: Kindle	To DiapDealer about double unescaping. It is not that innocent as it can appear. Of course if value doesn't contain any '&' additional unsnapping would not do any harm. The problem happens when unescaped value contains known entity. For example is title of the article is "Don't double unescape & in metadata". Escaped string would be "Don't double unescape &amp; in metadata". If you unescape it twice or unescape original string you would get "Don't double unescape & in metadata" and this is not what original title was. In short: double unescaping is a bug and it results in "data loss".

01-02-2013, 12:32 AM	#470
Sergey Dubinets Junior Member Posts: 6 Karma: 10 Join Date: Dec 2012 Device: Kindle	If value in the mobi file is html escaped you need to unescape it using HTML rules for processing and then escape it according XML rules when writing to XML file. As you do. My statement was: you can't unescape "just in case" (because no harm was done.) If metadata has escaped strings we have to unescape them. If it has none-escaped strings we shouldn't do this.

01-07-2013, 02:48 PM	#474
Doitsu Grand Sorcerer Posts: 5,607 Karma: 23165369 Join Date: Dec 2010 Device: Kindle PW2	AFAIK, mobi_dict.py is a module that is automatically called by the main .pyw if a dictionary .mobi file is detected. I.e. if you want to decompile a dictionary simply execute Mobi_Unpack.pyw. It'll automatically execute mobi_dict.py with the correct parameters.

01-15-2013, 09:46 AM	#478
holdit Connoisseur Posts: 86 Karma: 470352 Join Date: Dec 2012 Device: Kindle Fire, IPad	Thx to all! Hit

01-17-2013, 03:05 AM	#479
pdurrant The Grand Mouse 高貴的老鼠 Posts: 71,618 Karma: 306652114 Join Date: Jul 2007 Location: Norfolk, England Device: Kindle Voyage	Version 0.61 has now been uploaded to the first post of the thread. This includes all recent fixes for the scripts, and now should fully support the use of unicode file names, thanks to lots of work by KevinH. With version 0.61 the name of the scipt has been changed to KindleUnpack, since almost all Mobipocket files are now actually Kindle files from Amazon, and the script certainly handles files that are not Mobipocket at all (KF8 and .azw4).

01-17-2013, 07:51 PM	#480
JSWolf Resident Curmudgeon Posts: 74,576 Karma: 129670952 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	Now we just need the Calibre plugin updated.