12-12-2012, 12:55 PM | #451 | |
Sigil Developer
Posts: 8,097
Karma: 5450184
Join Date: Nov 2009
Device: many
|
Hi,
Code:
- name = txtdata[offset:offset+ilen] + name = unicode(txtdata[offset:offset+ilen], 'windows-1252').encode('utf-8') I think the mobi header gives the proper encoding. If so, we will need to pass the encoding from mobi_unpack into the mobi_ncx and mobi_opf and convert from bytestring in the specified encoding to utf-8 bytestring. Kevin Quote:
Last edited by KevinH; 12-13-2012 at 09:12 AM. Reason: fix for auto smileys |
|
12-12-2012, 03:59 PM | #452 |
Wannabe Connoisseur
Posts: 426
Karma: 2516674
Join Date: Apr 2011
Location: Geelong, Australia
Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX
|
|
12-13-2012, 09:14 AM | #453 |
Sigil Developer
Posts: 8,097
Karma: 5450184
Join Date: Nov 2009
Device: many
|
Hi,
I never noticed that before. I put the diff snippet in a code block and hopefully the auto smileys are gone! Thanks, KevinH |
12-13-2012, 04:16 PM | #454 |
Grand Sorcerer
Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
|
|
12-13-2012, 07:29 PM | #455 |
Sigil Developer
Posts: 8,097
Karma: 5450184
Join Date: Nov 2009
Device: many
|
|
12-27-2012, 03:56 PM | #456 |
Junior Member
Posts: 6
Karma: 10
Join Date: Dec 2012
Device: Kindle
|
A few (possible) bugs I noticed reading the code.
1. PalmdocReader misses the case where c == 0
2. MobiHeader.__init__() assignes self.othidx = 0xfffffff instead of 0xffffffff 3. getLanguage 26 has two entries. 4. mobi_unpack.py:727 "# bytes 19 - 23: start of xor string" => "# bytes 20 - 23: start of xor string" 5. mobi_k8proc.__init__() adds 0xfffffff (instead of 0xffffffff) to the end of self.fdsttbl list 6. getVariableWidthValue() and readTagSection() are defined twice: in mobi_utils.py and mobi_index.py 7. countSetBits() is defined twice: in mobi_index.py and mobi_dict.py; It doesn't need to be a member functions. 8. the same with getTagMap(). 9. This is not the most optimum way to write countSetBits(). See bellow (sorry for C#). 10. num += 1 at the end of parseNCX() is redundant 11. re: '''<[^<>]+filepos=['"]{0,1}(\d+)[^<>]*>'''. What about closing quite or apos? 12. re: join removing empty anchors in sigle substitution. Existent re doesn't handle all all possible WS. 13. mobi_opf.py:127. print format parameters are missing. 14. mobi_opf.py:51 escape() function is for escaping HTML text nodes not attribute values. It doesn't escape " to " 15. mobi_opf.py:222 tries to find 'StartOffset' in the metadata. This is hopeless becuase all keys including ('StartOffset') ware deleted at line 154. public static int countSetBits(int value) { int count = 0; while (value != 0) { count ++; value &= value - 1; // "eats" lowest 1 bit in the value } return count; } |
12-28-2012, 01:03 AM | #457 |
Sigil Developer
Posts: 8,097
Karma: 5450184
Join Date: Nov 2009
Device: many
|
your comments
Hi,
Thanks for catching all of these! Nice job! I will incorporate the appropriate fixes into my most recent tree which has many other bug fixes and make a new release hopefully in a week or two. Take care, KevinH |
12-28-2012, 12:37 PM | #458 |
Sigil Developer
Posts: 8,097
Karma: 5450184
Join Date: Nov 2009
Device: many
|
Thanks for Your Bug Report
Hi Sergey,
Your version is a bit older than my version as line numbers do not match up. Did you use Mobi_Unpack v59 or an earlier version? > 1. PalmdocReader misses the case where c == 0 Doesn't the case c < 128 handle this? What am I missing? > 2. MobiHeader.__init__() assignes self.othidx = 0xfffffff instead of 0xffffffff fixed > 3. getLanguage 26 has two entries. fixed: merged into single table entry > 4. mobi_unpack.py:727 "# bytes 19 - 23: start of xor string" => "# bytes 20 - 23: start of xor string" typo fixed > 5. mobi_k8proc.__init__() adds 0xfffffff (instead of 0xffffffff) to the end of self.fdsttbl list this was already fixed in my version > 6. getVariableWidthValue() and readTagSection() are defined twice: in mobi_utils.py and mobi_index.py moved to mobi_index and removed from mobi_utils > 7. countSetBits() is defined twice: in mobi_index.py and mobi_dict.py; It doesn't need to be a member functions. changed to non-member function in mobi_index and removed from mobi_dict > 8. the same with getTagMap(). changed to non-member function in mobi_index and removed from mobi_dict > 9. This is not the most optimum way to write countSetBits(). See bellow (sorry for C#). mask and shift version is normal way (easily understood) to do this and works well (not the bottleneck in execution speed) > 10. num += 1 at the end of parseNCX() is redundant fixed: removed last line > 11. re: '''<[^<>]+filepos=['"]{0,1}(\d+)[^<>]*>'''. What about closing quite or apos? this is fine, we are just capturing the digits and any closing ['"] captured by [^<>]* > 12. re: join removing empty anchors in sigle substitution. Existent re doesn't handle all all possible WS. I am not sure about this one. Exactly which file and which line are you talking about here? Can you give me more specifics? > 13. mobi_opf.py:127. print format parameters are missing. > 14. mobi_opf.py:51 escape() function is for escaping HTML text nodes not attribute values. It doesn't escape " to " mobi_opf has recently been rewritten to properly escape things so I think this has been taken care of. > 15. mobi_opf.py:222 tries to find 'StartOffset' in the metadata. This is hopeless becuase all keys including ('StartOffset') ware deleted at line 154. yes, as was noted in the code, this one fixed by adding self.starting_offset which is initialized as None and set when processing the first time so it is available later So I think all of these changes have been made except for your number 12 and possibly for your number 1. Can you add some more detail there? Thanks, KevinH |
12-30-2012, 03:05 AM | #459 |
Junior Member
Posts: 6
Karma: 10
Join Date: Dec 2012
Device: Kindle
|
I am at v.0.58 I believe.
1. & 11. agree. Sorry for false alarm. 12. srctext = re.sub(r"<a/>",r"", srctext) srctext = re.sub(r"<a ?></a>",r"", srctext) "<a />", "<a> </a>" would not be removed but are as empty as "<a ></a>". It's not a perf bottle neck for sure, but you may consider matching both empty tags in single expression, like "(<a\s*/>)|(<a\s*>\s*</a>)". I'll upgrade to 0.59 now. Thanks. |
12-30-2012, 11:54 AM | #460 |
Sigil Developer
Posts: 8,097
Karma: 5450184
Join Date: Nov 2009
Device: many
|
Hi,
If you can wait a bit, I will post a Mobi_Unpack_experimental with those fixes and many more, plus using multiprocessing in place of subprocess calls in the gui wrapper, that should allow for better unicode support in filenames and paths. I would love to get feedback on it before releasing a new v0.60 version. Thanks, Kevin |
12-30-2012, 01:48 PM | #461 |
Sigil Developer
Posts: 8,097
Karma: 5450184
Join Date: Nov 2009
Device: many
|
Hi,
Okay, here is an experimental version of Mobi_Unpack (call it v0.61beta). It should have all of the outstanding bug fixes plus some robustness improvements for official Kindle ebooks that are not quite correctly generated (thank you Kovid), mobi_opf improvements (thanks DiapDealer), fixes for Sergey's bugs (thanks), a more correct fix for bug from nleblanc88 (thanks), changes to allow internal use of utf-8 so that files and paths that require full unicode to be properly specified should now hopefully work, as well as changes to remove the need for unbuffered output via a shift to use the multiprocessing module, fixes for sometime hangs in debug mode, support for CTOC sections being properly labeled in debug mode, and etc. It still really needs to be refactored and cleaned up but this should have everything I know about. Please give it a try. If it does not fix your bug or you run into problems of any sort, please let us know here asap. If it passes muster, it will become version 0.61. Thanks, KevinH Last edited by KevinH; 12-30-2012 at 01:49 PM. Reason: fix typo |
12-31-2012, 04:04 AM | #462 |
Junior Member
Posts: 6
Karma: 10
Join Date: Dec 2012
Device: Kindle
|
v0.61beta works well.
Here are some comments so far: 1. mobi_ncx.py:9 we don't need to import readTagSection, getVariableWidthValue to this module. 2. Program can print nice disagnostic. The problem is that it prints UTF-8 strings to console. This works only for english text (at list on WIndows). When I debug Russian books I see less readable debug output. 3. escape/unescape in OPF. You recently added HTMLParser.unescape(). Are you sure that original values are escaped? Unescaping on not escaped values would be a bug. Using saxutils.escape() is correct for text nodes: data.append('<%s>%s</%s>\n' % (tag, xmlescape(self.h.unescape(value)), closingTag)) And is not suficient for attribute values: data.append('<meta name="%s" content="%s" />\n' % (name, xmlescape(self.h.unescape(value)))) I later case you need also escape " as " and ' as ' I sugest you use quoteattr() for atributes instead of escape() 4. mobi_unpack.py:621 Why you don't use setsectiondescription() method? The same with 6 other ocations in the same file. 5. mobi_unpack.py:704 Redundant call. the same 696, 697, 698 6. mobi_unpack.py:905 method is never used 7. mobi_unpack.py:608 duplicate map entry |
12-31-2012, 09:12 AM | #463 | |
Sigil Developer
Posts: 8,097
Karma: 5450184
Join Date: Nov 2009
Device: many
|
Hi,
Thanks for your testing. I will look at all of the issues you pointed out. But I am most interested in issues with encodings. This version should work better since utf-8 can encode all possible characters. Did you run from the command line or via the gui? The gui log window should show all characters correctly. Does it? If running from the command-line on on Windows the best way to run the program is to change your codepage to cp65001 first. If you do that does it work? Thanks, Kevin Quote:
|
|
12-31-2012, 10:49 AM | #464 |
Sigil Developer
Posts: 8,097
Karma: 5450184
Join Date: Nov 2009
Device: many
|
Hi Sergey,
> 1. mobi_ncx.py:9 we don't need to import readTagSection, getVariableWidthValue to this module. Yes, since refactored earlier, these are no longer needed >2. Program can print nice disagnostic. The problem is that it prints UTF-8 strings to console. This works only for english text (at list on WIndows). When I debug Russian books I see less readable debug output. No actually utf-8 should be able to represent any character in any language. The problem is Windows does not use cp65001 (utf-8) for its console but some other cp, that can not represent all possible chars. Then Windows allows filenames and paths to have full unicode names that can not be represented by their current limited 8-bit encoding. This is a serious bug as you can be sent files that you can not access in any way in python or the console. Using utf-8 (cp65001) should allow python code to access any file or path on your system even if written in Japanese or Chinese let alone Russian. I was hoping that since the Tk widgets in the Mobi_Unpack GUI use utf-8 internally, that when you use the GUI front-end to Mobi_Unpack, it should show characters properly in the Log window no matter what (unless you have non-unicode capable fonts installed). If you use the command line/console, the user should be able to change the cp to be 65001 (utf-8) and have things work for any file or path in command line/console mode. I might be able to wrapper this for stdout so it converts back to console encoding but the better solution is to use a suitable encoding for the console that can represent all characters (cp65001 = utf-8). So if you get a chance, please try it both ways and see what it takes to get both the console and gui mode to work properly. The real problem is Windows allows full unicode file and path names but then uses a console encoding (and possibly fonts) that will not properly show the full range of characters. This is silly in the extreme (imho). > 3. escape/unescape in OPF. You recently added HTMLParser.unescape(). > Are you sure that original values are > > escaped? Unescaping on not escaped values would be a bug. > Using saxutils.escape() is correct for text nodes: > data.append('<%s>%s</%s>\n' % (tag, > xmlescape(self.h.unescape(value)), closingTag)) > And is not suficient > > for attribute values: > data.append('<meta name="%s" content="%s" />\n' % (name, > xmlescape(self.h.unescape(value)))) > > I later case you need also escape " as " and ' as ' > I sugest you use quoteattr() for atributes instead of escape() DiapDealer is working on trying to fix the problem in the opf of some Mobi ebooks including html in the metadata when they technically should not. Since the opf is an xml document, we can not allow any html into the metadata values we will then convert into the proper xml opf entries. I am not up-to-speed on what he wants to do here so I will ask DiapDealer to look at this again to make sure your concerns are dealt with. > 4. mobi_unpack.py:621 Why you don't use setsectiondescription() method? > The same with 6 other ocations in the same file. fixed > 5. mobi_unpack.py:704 Redundant call. the same 696, 697, 698 removed since duplicated in init, ditto for the others > 6. mobi_unpack.py:905 method is never used it is used when debugging the rawml, it is just not used in this version of the file. keeping it causes no harm. > 7. mobi_unpack.py:608 duplicate map entry[/QUOTE] fixed by removing duplicate. Thanks! KevinH |
12-31-2012, 02:11 PM | #465 | ||||
Grand Sorcerer
Posts: 27,903
Karma: 198500000
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
Quote:
Quote:
Many Kindle books are starting to come down the pike with html and/or entities in the MOBI/KF8 EXTH metadata. While that may be acceptable in a MOBI/KF8 file, it's unacceptable according to XML/OPF specs (other than the standard 5 entities for XML). I see no point in creating a non-compliant OPF file, so... If there are no named/numbered entities in the contents of the metadata, then HTMLParser.unescape() will simply have no effect on it. Nothing. No bug. If there ARE any named/numbered entities, however... HTMLParser.unescape() will first convert them all to their unicode/utf-8 counterpart character representations. Saxutils.escape() then takes care of xml-escaping the mandatory (< > &) characters to complete all XML/OPF compliance. Descriptions often contain html paragraph formatting and the current method ensures that all html tags will be properly xml-escaped while at the same time, not completely destroying the intention of any unsupported (unsupported in XML/OPF) entities that may have been present in the MOBI/KF8 EXTH metadata. I agree it may be overkill (some things could conceivably go from entity to character and back to entity, for instance). But I see no other method (meaning other standard python library method) to ensure that every potentially non-compliant hodge-podge of text, html, and entities becomes docile, XML/OPF-compliant entries. Quote:
But I'm not certain quoteattr() is the right approach, though -- as it can potentially change double-quotes to single quotes and vice-versa, depending on the situation. In such a case, I think it would make more sense to extend the escape() method by passing it the optional "entities" dictionary parameter, so that " and ' are xml-escaped as well as the three mandatory < > and &, rather than potentially changing double quotes to single quotes. Code:
ENTITIES = {'"':'"', "'":"'"} data.append('<meta name="%s" content="%s" />\n' % (name, xmlescape(self.h.unescape(value), ENTITIES))) Last edited by DiapDealer; 12-31-2012 at 04:35 PM. |
||||
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Can i rotate text and insert images in Mobi and EPUB? | JanGLi | Kindle Formats | 5 | 02-02-2013 04:16 PM |
PDF to Mobi with text and images | pocketsprocket | Kindle Formats | 7 | 05-21-2012 07:06 AM |
Mobi files - images | DWC | Introduce Yourself | 5 | 07-06-2011 01:43 AM |
pdf to mobi... creating images rather than text | Dumhed | Calibre | 5 | 11-06-2010 12:08 PM |
Transfer of images on text files | anirudh215 | 2 | 06-22-2009 09:28 AM |