Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Kindle Formats

Notices

Reply
 
Thread Tools Search this Thread
Old 12-31-2012, 08:48 PM   #466
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 9,070
Karma: 6361556
Join Date: Nov 2009
Device: many
Hi Sergey,

If you are not seeing the correct characters in the Log window when running the GUI, please try replacing the following class in Mobi_Unpack.pyw with the following:
Code:
# Wrap a stream so that output gets appended to shared queue
# using utf-8 encoding
class QueuedStream:
    def __init__(self, stream, q):
        self.stream = stream
        self.encoding = stream.encoding
        self.q = q
        if self.encoding == None:
            self.encoding = 'utf-8'
    def write(self, data):
        if isinstance(data,unicode):
            data = data.encode('utf-8',"replace")
        elif self.encoding != 'utf-8':
            udata = data.decode(self.encoding)
            data = udata.encode('utf-8', "replace")
        self.q.put(data)
    def __getattr__(self, attr):
        return getattr(self.stream, attr)

This should decode the stdout from the mobi_unpack.py (which will be in your local Russian code page) and encode it into utf-8 so that it should get written properly to the Log window (hopefully).

Please let me know if this helps.

Thanks,

KevinH
KevinH is offline   Reply With Quote
Old 01-01-2013, 07:19 PM   #467
Sergey Dubinets
Junior Member
Sergey Dubinets began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Dec 2012
Device: Kindle
To: DiapDealer about quoteattr().
quoteattr() doesn't change " to ' and back in attribute value if you mean this. If attribute value doesn't have " quoteattr() would put it into "" without additional encoding. The same with '. If both ' and " are present in the value quoteattr() would replace " to " and use " around.

If you wish for some reason always put attribute values into "" you can escape ". There is no need to escape ' in this case.

Last edited by Sergey Dubinets; 01-01-2013 at 07:23 PM.
Sergey Dubinets is offline   Reply With Quote
Old 01-01-2013, 07:39 PM   #468
Sergey Dubinets
Junior Member
Sergey Dubinets began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Dec 2012
Device: Kindle
To DiapDealer about double unescaping.
It is not that innocent as it can appear.
Of course if value doesn't contain any '&' additional unsnapping would not do any harm.
The problem happens when unescaped value contains known entity. For example is title of the article is "Don't double unescape & in metadata".
Escaped string would be "Don't double unescape & in metadata".
If you unescape it twice or unescape original string you would get "Don't double unescape & in metadata" and this is not what original title was.

In short: double unescaping is a bug and it results in "data loss".
Sergey Dubinets is offline   Reply With Quote
Old 01-01-2013, 10:10 PM   #469
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 28,867
Karma: 207000000
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Originally Posted by Sergey Dubinets View Post
To DiapDealer about double unescaping.
It is not that innocent as it can appear.
Of course if value doesn't contain any '&' additional unsnapping would not do any harm.
The problem happens when unescaped value contains known entity. For example is title of the article is "Don't double unescape & in metadata".
Escaped string would be "Don't double unescape & in metadata".
If you unescape it twice or unescape original string you would get "Don't double unescape & in metadata" and this is not what original title was.

In short: double unescaping is a bug and it results in "data loss".
I think you still might be mistaken. I'm not "double unescaping" anything. I'm first unescaping everything and then re-escaping only the three characters that must be escaped. Under the current code:

Code:
xmlescape(self.h.unescape(value)
HTMLParser.unescape() first takes: "Don't double unescape & in metadata".

And makes it: "Don't double unescape & in metadata".

Then saxutils.escape() makes it: "Don't double unescape & in metadata".

No data loss. And you can't create a mobi with kindlegen that preserves and displays the literal text "&" in the title anyway.

Your example is a perfect illustration of why I've chosen to do it the way I have. Without HTMLParser's initial unescape(), using the saxutils escape() method alone (which is required to handle any html tags or unescaped ampersands) would result in a valid "&" being turned into "&". Just like you described.

The current method will preserve all pre-existing < > and & entities while converting any other entities encountered to their character representations and properly escaping any html tags and naked ampersands.

Last edited by DiapDealer; 01-01-2013 at 10:32 PM.
DiapDealer is online now   Reply With Quote
Old 01-02-2013, 12:32 AM   #470
Sergey Dubinets
Junior Member
Sergey Dubinets began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Dec 2012
Device: Kindle
If value in the mobi file is html escaped you need to unescape it using HTML rules for processing and then escape it according XML rules when writing to XML file. As you do.

My statement was: you can't unescape "just in case" (because no harm was done.)
If metadata has escaped strings we have to unescape them.
If it has none-escaped strings we shouldn't do this.
Sergey Dubinets is offline   Reply With Quote
Old 01-02-2013, 03:07 AM   #471
pdurrant
The Grand Mouse 高貴的老鼠
pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.
 
pdurrant's Avatar
 
Posts: 74,412
Karma: 318076944
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Oasis
Quote:
Originally Posted by Sergey Dubinets View Post
My statement was: you can't unescape "just in case" (because no harm was done.)
If metadata has escaped strings we have to unescape them.
If it has none-escaped strings we shouldn't do this.
The problem we have in Mobiunpack is that there is no metadata in the mobipocket file to tell us whether the string has been escaped or not.

All we can do is pick the least bad option. Which, IMO, is to unescape the text. It is far, far more common that the text has been escaped than that the text is unescaped but with some apparently escaped entities.
pdurrant is offline   Reply With Quote
Old 01-02-2013, 09:17 PM   #472
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 9,070
Karma: 6361556
Join Date: Nov 2009
Device: many
Mobi_Unpack_experimental V2

Hi,

Here is a new version of Mobi_Unpack (experimental v2) which has all known bug fixes in place.

Its primary new feature is a more robust GUI interface (Mobi_Unpack.pyw) that should better support international users on Windows with improved full unicode support for all file paths and file names.

This should be considered beta level software. I would really appreciate hearing back about any successes or failures. If it works well this version should become Mobi_Unpack_v061 final.

Thanks,

KevinH
Attached Files
File Type: zip Mobi_Unpack_experimental_v2.zip (56.6 KB, 459 views)
KevinH is offline   Reply With Quote
Old 01-07-2013, 02:17 PM   #473
Loceka
Member
Loceka began at the beginning.
 
Posts: 24
Karma: 10
Join Date: Jan 2013
Device: Kobo Glo
Hello,

I'm new to the forum and I've found that wonderful script(s) but I can't find out how to launch the mobi_dict.py script.

I've seen no specific option in the GUI and launching it using python(with or without arguments) has no effect:
Code:
python mobi_dict.py <in> <out>
So what's the way to use it ?

Thanks,
Loceka.
Loceka is offline   Reply With Quote
Old 01-07-2013, 02:48 PM   #474
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,763
Karma: 24088559
Join Date: Dec 2010
Device: Kindle PW2
AFAIK, mobi_dict.py is a module that is automatically called by the main .pyw if a dictionary .mobi file is detected.
I.e. if you want to decompile a dictionary simply execute Mobi_Unpack.pyw. It'll automatically execute mobi_dict.py with the correct parameters.
Doitsu is offline   Reply With Quote
Old 01-07-2013, 02:55 PM   #475
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 28,867
Karma: 207000000
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Originally Posted by Loceka View Post
Hello,

I'm new to the forum and I've found that wonderful script(s) but I can't find out how to launch the mobi_dict.py script.

I've seen no specific option in the GUI and launching it using python(with or without arguments) has no effect:
Code:
python mobi_dict.py <in> <out>
So what's the way to use it ?

Thanks,
Loceka.
The mobi_dict.py script is not meant to be invoked directly. The main script imports the necessary methods and classes from mobi_dict.py when it encounters a dictionary-type mobi. Note however, that dictionary support has always been very limited and quite fragile. There's a very good possibility that even if you get your dictionary successfully unpacked, you might never get it rebuilt into a functioning mobi dictionary again.
DiapDealer is online now   Reply With Quote
Old 01-08-2013, 02:31 PM   #476
Loceka
Member
Loceka began at the beginning.
 
Posts: 24
Karma: 10
Join Date: Jan 2013
Device: Kobo Glo
Thank you both for your answers.
I was mistaken by the file name and thought it was meant to convert Mobi dictionaries to the DICT format, my bad.

The dictionnaries I tried to extract where mostly successfully extracted despite the errors in the logs :
Code:
Error: Dictionary contains multiple inflection index sections, which is not yet supported
Error: Dictionary uses obsolete inflection rule scheme which is not yet supported
I'm not quite sure about what they really mean (if something will be messed up in the extracted HTML file or not) but the output HTML file is not empty and seems ok.
Loceka is offline   Reply With Quote
Old 01-09-2013, 05:05 PM   #477
Loceka
Member
Loceka began at the beginning.
 
Posts: 24
Karma: 10
Join Date: Jan 2013
Device: Kobo Glo
Well thank you all again for those scripts.

I've made one of my own (in Perl) that converts a mobi dictionary into a Kobo format dictionary.

Actually it may not be really useful because the ones I tried did not match the default Kobo dictionaries, but still it worked for me.

As for the script itself it must be launched as :
Code:
perl mobi2kobo.pl -i <input file> -o <output dir>
For the moment it is expecting a cp1252 (WinLatin1) encoded HTML Mobi file as input file. If the input file is UTF8 encoded, it should be changed in the source code.

It also requires some necessary third-party programs :
  • marisa in order to build the index
  • gzip to compress the ouput HTML files
  • zip to create the final archive
Therefore, for the moment it only works on Linux (possibly Mac OS) but it should not be too hard to have it working on Windows too.
Attached Files
File Type: pl mobi2kobo.pl (8.4 KB, 802 views)
Loceka is offline   Reply With Quote
Old 01-15-2013, 09:46 AM   #478
holdit
Connoisseur
holdit ought to be getting tired of karma fortunes by now.holdit ought to be getting tired of karma fortunes by now.holdit ought to be getting tired of karma fortunes by now.holdit ought to be getting tired of karma fortunes by now.holdit ought to be getting tired of karma fortunes by now.holdit ought to be getting tired of karma fortunes by now.holdit ought to be getting tired of karma fortunes by now.holdit ought to be getting tired of karma fortunes by now.holdit ought to be getting tired of karma fortunes by now.holdit ought to be getting tired of karma fortunes by now.holdit ought to be getting tired of karma fortunes by now.
 
Posts: 86
Karma: 470352
Join Date: Dec 2012
Device: Kindle Fire, IPad
Thx to all!

Hit
holdit is offline   Reply With Quote
Old 01-17-2013, 03:05 AM   #479
pdurrant
The Grand Mouse 高貴的老鼠
pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.
 
pdurrant's Avatar
 
Posts: 74,412
Karma: 318076944
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Oasis
Version 0.61 has now been uploaded to the first post of the thread.

This includes all recent fixes for the scripts, and now should fully support the use of unicode file names, thanks to lots of work by KevinH.

With version 0.61 the name of the scipt has been changed to KindleUnpack, since almost all Mobipocket files are now actually Kindle files from Amazon, and the script certainly handles files that are not Mobipocket at all (KF8 and .azw4).
pdurrant is offline   Reply With Quote
Old 01-17-2013, 07:51 PM   #480
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 80,685
Karma: 150249619
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Now we just need the Calibre plugin updated.
JSWolf is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Can i rotate text and insert images in Mobi and EPUB? JanGLi Kindle Formats 5 02-02-2013 04:16 PM
PDF to Mobi with text and images pocketsprocket Kindle Formats 7 05-21-2012 07:06 AM
Mobi files - images DWC Introduce Yourself 5 07-06-2011 01:43 AM
pdf to mobi... creating images rather than text Dumhed Calibre 5 11-06-2010 12:08 PM
Transfer of images on text files anirudh215 PDF 2 06-22-2009 09:28 AM


All times are GMT -4. The time now is 12:28 PM.


MobileRead.com is a privately owned, operated and funded community.