09-10-2011, 04:04 PM | #151 | |
Developer
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
|
Quote:
in my version I've already introduced a list of "known unknown" metadata (means that we know that these values exist, but we don't know the meaning) and mobiunpack complains only if an unknown value isn't in this list. I hope I'll find time to release my version soon Ciao, Steffen |
|
09-10-2011, 04:11 PM | #152 |
Junior Member
Posts: 2
Karma: 10
Join Date: Sep 2011
Device: iPad
|
Edited! Ok to delete this post as I found another thread that better answers my question
Last edited by Anjelous; 09-10-2011 at 05:02 PM. |
09-12-2011, 09:05 AM | #153 |
Member
Posts: 11
Karma: 10
Join Date: Sep 2011
Device: kindle 3
|
mobiunpack modifications
Hello everybody.
First I'd like to thank the community for all the good work, without the homebrew tools my experience with the mobi file format and the kindle as a whole would'nt have been nearly as nice ! Back on topic, I first came here to ask if somebody's maintaining mobiunpack.py / accepting patches, but reading the last few post it would seem that both pdurrant and siebert are working on a branch, am I right ? If so could I contribute ? ... Also in the last few posts there were talks about extracting the NCX from mobi files, it just so happens that's the very feature I've been toying with this weekend and made me come here today At this time I got a (pretty awful) proof of concept code that can extract flat "chapter only" NCX, I got the necessay clues from the "writer" part of the calibre mobi module, I could elaborate on that if somebody's interested... Apart from that I made some corrections (like the encoding header in the html, which appears to be in siebert's branch) and also have an alternate "Adding anchors..." code that reconstructs all anchors, even when they're not referenced, and should avoid adding anchors in the <head> (a bug i encountered with some files). I was also interested in re-factoring the code to be more readable / workable (this also appears to be in siebert's plans ). I started with the (pdurrant's ?) version @ http://code.google.com/, but wouldn't mind switching... |
09-12-2011, 09:16 AM | #154 |
Sigil Developer
Posts: 7,650
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Hi,
Great! The more the merrier. If you look through this topic you will find links to later versions than what we (pdurrant and I) hosted on code.google.com - we have not bothered to update that site lately. Yes, you are right siebert has added support for Dictionaries and made some major speed improvements. I have added code to spit out more of the metadata so that the tool can be used to investigate more about what each metadata means (for example we recently found what we think is the expiration date), and pdurrant has added support for non-drm versions of the .azw4 format. Simply walk through this thread and grab the very latest version of mobiunpack.py that you see and use that as your starting point. I believe you want mobiunpack.py version 0.31 posted by pdurrant a few days ago to this thread. siebert may have an even newer version but I don't think he has posted it yet. Let me know if you can't find it and I will post it again for you. KevinH |
09-12-2011, 09:16 AM | #155 |
The Grand Mouse 高貴的老鼠
Posts: 71,510
Karma: 306214458
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
|
I'm afraid the google code versions is very much out of date. The current version is 0.31, which can be found in this thread here.
While we really ought to be using version control software, in a clever shared manner, at present we seem to be posting updates here, which I copy back to the fifth post in this thread. Some ncx generation code would be welcome. I posted a sample of the binary data representing an ncx, along with the source ncx file, here. Any other changes would be interesting to see too. |
09-12-2011, 09:17 AM | #156 |
Grand Sorcerer
Posts: 27,552
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
I'd be very interested in the seeing the NCX extraction code you've come up with. I use calibre to convert epubs to mobi, and then feed the output of mobiunpack to kindlegen... so not having to rebuild the NCX by hand each time would be very welcome indeed.
|
09-12-2011, 11:54 AM | #157 | |
Developer
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
|
Quote:
Kovid removed that option a few days ago because he doesn't like me and my request to make that option selectable via the gui (though I'm obviously not the only person liking that feature), but if you are willing to use either an older or a modified version of calibre you don't need the mobiunpack step. Ciao, Steffen |
|
09-12-2011, 12:14 PM | #158 |
Member
Posts: 11
Karma: 10
Join Date: Sep 2011
Device: kindle 3
|
here it is
Thank you very much for all your quick replies !
I just downloaded v31 from the link you provided and finished retro-fitting the modifications I had made to v23. I'll try to explain why / how I did those changes later but first, as code speaks louder than words, here's the file. .... Just a few words: * I just finished merging, it's not tested * The NCX part is really a proof of concept, it does however produce an acceptable output on my test files with flat NCX. It consists of: - a code block with 3 methods just before unpackBook - a main code block in unpackBook, enclosed by "#TEST NCX" - a small mod to the OPF code, to add a ref to the NCX * Other than there's some "empirical" changes I made while testing some files: - FILEPOS_ON_ALL_ANCHORS: an option to use an alternate code that processes all empty anchors instead of focusing on existing links... - replaced a " " by "\s+" in the "Insert hrefs into html" rx... - alternate way to set the html file encoding EDIT: sorry, the file i uploaded contained several fatal errors i failed to spot. EDIT: the new file should work at least with calibre-generated mobis... EDIT2: added a text file describing what I gathered of the NCX equivalent in MOBI EDIT3: basic fixes to the code... Last edited by fandrieu; 09-12-2011 at 06:03 PM. |
09-12-2011, 12:48 PM | #159 | |
Grand Sorcerer
Posts: 27,552
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
|
|
09-12-2011, 01:58 PM | #160 | |
Sigil Developer
Posts: 7,650
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Hi,
Thanks for posting! I grabbed it and tried it on a bunch of mobis I had and unfortunately, the internal links anchors from many of the internal links in the document no longer work. I tested it with mobiunpack version 31 without your changes and all internal links worked. So somehow your changes have broken some of the internal links. I will try to track this down. I did get some form of NCX file but it was incomplete and there were error messages: Write html ERROR: last byte not 0x80 ERROR: text not found 1354424 Wite ncx Write opf I will keep playing with it to see if I get get the internal links working again. Thanks for getting this ncx stuff going! KevinH Quote:
|
|
09-12-2011, 02:03 PM | #161 | |
The Grand Mouse 高貴的老鼠
Posts: 71,510
Karma: 306214458
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
|
Quote:
|
|
09-12-2011, 02:29 PM | #162 |
Grand Sorcerer
Posts: 27,552
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
|
09-12-2011, 02:32 PM | #163 | |
Sigil Developer
Posts: 7,650
Karma: 5433388
Join Date: Nov 2009
Device: many
|
Hi,
Your link_pattern used if FILEPOS_ON_ALL_ANCHORS is True seems to be a bit broken: For example: here is what the rawml says for one link: <a filepos=0000006414 >M<span><font size="2">APS</font></span></a> but this link is never properly detected or processed by your link pattern: link_pattern = re.compile(r'''<a\s*(></a>|/>)''', re.IGNORECASE) So you might want to take another look at your link patterns to make sure rawml of this type gets processed properly. Hope this helps, KevinH Quote:
|
|
09-12-2011, 04:50 PM | #164 |
Sigil Developer
Posts: 7,650
Karma: 5433388
Join Date: Nov 2009
Device: many
|
index support
Hi,
Okay I looked more at this index material. It appears the "type" information is key to understanding how to read in the indx information. For example: To correctly parse the indx entries, I had to do something like the following: if type == 0x1f: # handle next two variable width unknowns pos, unk1 = getVariableWidthValue(navdata,offset) offset += pos print "unknown 1 is ", unk1 pos, unk2 = getVariableWidthValue(navdata,offset) offset += pos print "unknown 2 is ", unk2 if type == 0xdf: # handle next threee variable width unknowns pos, unk1 = getVariableWidthValue(navdata,offset) offset += pos print "unknown 1 is ", unk1 pos, unk2 = getVariableWidthValue(navdata,offset) offset += pos print "unknown 2 is ", unk2 pos, unk3 = getVariableWidthValue(navdata,offset) offset += pos print "unknown 3 is ", unk3 pos, unk4 = getVariableWidthValue(navdata,offset) offset += pos print "unknown 4 is ", unk4 if type == 0x3f: # handle next threee variable width unknowns pos, unk1 = getVariableWidthValue(navdata,offset) offset += pos print "unknown 1 is ", unk1 pos, unk2 = getVariableWidthValue(navdata,offset) offset += pos print "unknown 2 is ", unk2 pos, unk3 = getVariableWidthValue(navdata,offset) offset += pos print "unknown 3 is ", unk3 and then there is no need to look for or skip 0x80 values. Also the count is not the same as the number of entries in the CTOC. From my set of ebooks, the CTOC data always ends with '\0\0' double null bytes and it has variable length. So I have attached a mobiunpack_test.py program that modifies things to work with a real amazon mobi ebook (as opposed to calibre generated ones). Perhaps this might help others trying to track things down. I am going to try and figure out what each of these unknowns actually means. Hope this helps, KevinH Last edited by KevinH; 09-15-2011 at 06:55 PM. |
09-12-2011, 05:36 PM | #165 | |
Developer
Posts: 155
Karma: 280
Join Date: Nov 2010
Device: Kindle 3 (Keyboard) 3G / iPad 9 WiFi / Google Pixel 6a (Android)
|
Quote:
INDX0 is the meta index and the TAGX section can be parsed with readTagSection(). INDX1 is the actual index data, and the CTOC data is like the inflNameData. Ciao, Steffen |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Can i rotate text and insert images in Mobi and EPUB? | JanGLi | Kindle Formats | 5 | 02-02-2013 04:16 PM |
PDF to Mobi with text and images | pocketsprocket | Kindle Formats | 7 | 05-21-2012 07:06 AM |
Mobi files - images | DWC | Introduce Yourself | 5 | 07-06-2011 01:43 AM |
pdf to mobi... creating images rather than text | Dumhed | Calibre | 5 | 11-06-2010 12:08 PM |
Transfer of images on text files | anirudh215 | 2 | 06-22-2009 09:28 AM |