KindleUnpack (MobiUnpack): Extracts text, images and metadata from Kindle/Mobi files - Page 31

KevinH · 12-12-2012, 12:55 PM

Hi,

Code:

-            name = txtdata[offset:offset+ilen]
+            name = unicode(txtdata[offset:offset+ilen], 'windows-1252').encode('utf-8')

I do not think CTOC in index sections are always encoded as windows-1252.
I think the mobi header gives the proper encoding. If so, we will need to pass the encoding from mobi_unpack into the mobi_ncx and mobi_opf and convert from bytestring in the specified encoding to utf-8 bytestring.

Kevin

Quote:

Originally Posted by nleblanc88

I'd like to contribute v060 if I could. What this version fixes:

--

Encoding chapter names in UTF-8. This fixes NCX and OPF files from being encoded in non UTF-8 encodings.

--

From my test, chapter names with UTF-8 characters were not being written properly to the resulting .NCX file. This causes the file charset to be "unknown-8bit", and trying to parse these files would result in errors.

This patch fixes this issue. I've attached the source.

--

I'd also like to bring up the idea of setting up a git repository for this project(bitbucket.com or github.com). I'd love to keep contributing to this project, and I think this would not only make it easier for me and others to do so, but also help the author keep track of all versions. I'd be willing to set this up if anybody would like.

snarkophilus · 12-12-2012, 03:59 PM

Quote:

Originally Posted by KevinH

- name = txtdata[offset

ffset+ilen]
+ name = unicode(txtdata[offset

ffset+ilen], 'windows-1252').encode('utf-8')

Gotta love those automagic smileys

.

Cheers,
Simon.

KevinH · 12-13-2012, 09:14 AM

Hi,

I never noticed that before. I put the diff snippet in a code block and hopefully the auto smileys are gone!

Thanks,

KevinH

DaleDe · 12-13-2012, 04:16 PM

Quote:

Originally Posted by KevinH

Hi,

I never noticed that before. I put the diff snippet in a code block and hopefully the auto smileys are gone!

Thanks,

KevinH

Just look under Additional Options and turn off smilies in text.

KevinH · 12-13-2012, 07:29 PM

Hi,

Quote:

Originally Posted by DaleDe

Just look under Additional Options and turn off smilies in text.

Will do. Thanks for the tip.

KevinH

Sergey Dubinets · 12-27-2012, 03:56 PM

1. PalmdocReader misses the case where c == 0
2. MobiHeader.__init__() assignes self.othidx = 0xfffffff instead of 0xffffffff
3. getLanguage 26 has two entries.
4. mobi_unpack.py:727 "# bytes 19 - 23: start of xor string" => "# bytes 20 - 23: start of xor string"
5. mobi_k8proc.__init__() adds 0xfffffff (instead of 0xffffffff) to the end of self.fdsttbl list
6. getVariableWidthValue() and readTagSection() are defined twice: in mobi_utils.py and mobi_index.py
7. countSetBits() is defined twice: in mobi_index.py and mobi_dict.py; It doesn't need to be a member functions.
8. the same with getTagMap().
9. This is not the most optimum way to write countSetBits(). See bellow (sorry for C#).
10. num += 1 at the end of parseNCX() is redundant
11. re: '''<[^<>]+filepos=['"]{0,1}(\d+)[^<>]*>'''. What about closing quite or apos?
12. re: join removing empty anchors in sigle substitution. Existent re doesn't handle all all possible WS.
13. mobi_opf.py:127. print format parameters are missing.
14. mobi_opf.py:51 escape() function is for escaping HTML text nodes not attribute values. It doesn't escape " to "
15. mobi_opf.py:222 tries to find 'StartOffset' in the metadata. This is hopeless becuase all keys including ('StartOffset') ware deleted at line 154.

public static int countSetBits(int value) {
int count = 0;
while (value != 0) {
count ++;
value &= value - 1; // "eats" lowest 1 bit in the value
}
return count;
}

KevinH · 12-28-2012, 01:03 AM

Hi,

Thanks for catching all of these! Nice job!

I will incorporate the appropriate fixes into my most recent tree which has many other bug fixes and make a new release hopefully in a week or two.

Take care,

KevinH

KevinH · 12-28-2012, 12:37 PM

Hi Sergey,

Your version is a bit older than my version as line numbers do not match up.
Did you use Mobi_Unpack v59 or an earlier version?

> 1. PalmdocReader misses the case where c == 0

Doesn't the case c < 128 handle this? What am I missing?

> 2. MobiHeader.__init__() assignes self.othidx = 0xfffffff instead of 0xffffffff

fixed

> 3. getLanguage 26 has two entries.

fixed: merged into single table entry

> 4. mobi_unpack.py:727 "# bytes 19 - 23: start of xor string" => "# bytes 20 - 23: start of xor string"

typo fixed

> 5. mobi_k8proc.__init__() adds 0xfffffff (instead of 0xffffffff) to the end of self.fdsttbl list

this was already fixed in my version

> 6. getVariableWidthValue() and readTagSection() are defined twice: in mobi_utils.py and mobi_index.py

moved to mobi_index and removed from mobi_utils

> 7. countSetBits() is defined twice: in mobi_index.py and mobi_dict.py; It doesn't need to be a member functions.

changed to non-member function in mobi_index and removed from mobi_dict

> 8. the same with getTagMap().

changed to non-member function in mobi_index and removed from mobi_dict

> 9. This is not the most optimum way to write countSetBits(). See bellow (sorry for C#).

mask and shift version is normal way (easily understood) to do this and works well (not the bottleneck in execution speed)

> 10. num += 1 at the end of parseNCX() is redundant

fixed: removed last line

> 11. re: '''<[^<>]+filepos=['"]{0,1}(\d+)[^<>]*>'''. What about closing quite or apos?

this is fine, we are just capturing the digits and any closing ['"] captured by [^<>]*

> 12. re: join removing empty anchors in sigle substitution. Existent re doesn't handle all all possible WS.

I am not sure about this one. Exactly which file and which line are you talking about here? Can you give me more specifics?

> 13. mobi_opf.py:127. print format parameters are missing.
> 14. mobi_opf.py:51 escape() function is for escaping HTML text nodes not attribute values. It doesn't escape " to "

mobi_opf has recently been rewritten to properly escape things so I think this has been taken care of.

> 15. mobi_opf.py:222 tries to find 'StartOffset' in the metadata. This is hopeless becuase all keys including ('StartOffset') ware deleted at line 154.

yes, as was noted in the code, this one fixed by adding self.starting_offset which is initialized as None and set when processing the first time so it is available later

So I think all of these changes have been made except for your number 12 and possibly for your number 1. Can you add some more detail there?

Thanks,

KevinH

Sergey Dubinets · 12-30-2012, 03:05 AM

I am at v.0.58 I believe.

1. & 11. agree. Sorry for false alarm.
12.
srctext = re.sub(r"<a/>",r"", srctext)
srctext = re.sub(r"<a ?></a>",r"", srctext)

"<a />", "<a> </a>" would not be removed but are as empty as "<a ></a>".
It's not a perf bottle neck for sure, but you may consider matching both empty tags in single expression, like "(<a\s*/>)|(<a\s*>\s*</a>)".

I'll upgrade to 0.59 now.

Thanks.

KevinH · 12-30-2012, 11:54 AM

Hi,

If you can wait a bit, I will post a Mobi_Unpack_experimental with those fixes and many more, plus using multiprocessing in place of subprocess calls in the gui wrapper, that should allow for better unicode support in filenames and paths.

I would love to get feedback on it before releasing a new v0.60 version.

Thanks,

Kevin

KevinH · 12-30-2012, 01:48 PM

Hi,

Okay, here is an experimental version of Mobi_Unpack (call it v0.61beta). It should have all of the outstanding bug fixes plus some robustness improvements for official Kindle ebooks that are not quite correctly generated (thank you Kovid), mobi_opf improvements (thanks DiapDealer), fixes for Sergey's bugs (thanks), a more correct fix for bug from nleblanc88 (thanks), changes to allow internal use of utf-8 so that files and paths that require full unicode to be properly specified should now hopefully work, as well as changes to remove the need for unbuffered output via a shift to use the multiprocessing module, fixes for sometime hangs in debug mode, support for CTOC sections being properly labeled in debug mode, and etc.

It still really needs to be refactored and cleaned up but this should have everything I know about. Please give it a try. If it does not fix your bug or you run into problems of any sort, please let us know here asap.

If it passes muster, it will become version 0.61.

Thanks,

KevinH

Sergey Dubinets · 12-31-2012, 04:04 AM

v0.61beta works well.

Here are some comments so far:

1. mobi_ncx.py:9 we don't need to import readTagSection, getVariableWidthValue to this module.

2. Program can print nice disagnostic. The problem is that it prints UTF-8 strings to console. This works only for

english text (at list on WIndows). When I debug Russian books I see less readable debug output.

3. escape/unescape in OPF. You recently added HTMLParser.unescape(). Are you sure that original values are

escaped? Unescaping on not escaped values would be a bug.
Using saxutils.escape() is correct for text nodes:
data.append('<%s>%s</%s>\n' % (tag, xmlescape(self.h.unescape(value)), closingTag))
And is not suficient

for attribute values:
data.append('<meta name="%s" content="%s" />\n' % (name, xmlescape(self.h.unescape(value))))

I later case you need also escape " as " and ' as '
I sugest you use quoteattr() for atributes instead of escape()

4. mobi_unpack.py:621 Why you don't use setsectiondescription() method? The same with 6 other ocations in the same

file.

5. mobi_unpack.py:704 Redundant call. the same 696, 697, 698

6. mobi_unpack.py:905 method is never used

7. mobi_unpack.py:608 duplicate map entry

KevinH · 12-31-2012, 09:12 AM

Hi,
Thanks for your testing. I will look at all of the issues you pointed out. But I am most interested in issues with encodings. This version should work better since utf-8 can encode all possible characters. Did you run from the command line or via the gui? The gui log window should show all characters correctly. Does it?

If running from the command-line on on Windows the best way to run the program is to change your codepage to cp65001 first. If you do that does it work?

Thanks,

Kevin

Quote:

Originally Posted by Sergey Dubinets

v0.61beta works well.

Here are some comments so far:

1. mobi_ncx.py:9 we don't need to import readTagSection, getVariableWidthValue to this module.

2. Program can print nice disagnostic. The problem is that it prints UTF-8 strings to console. This works only for

english text (at list on WIndows). When I debug Russian books I see less readable debug output.

3. escape/unescape in OPF. You recently added HTMLParser.unescape(). Are you sure that original values are

escaped? Unescaping on not escaped values would be a bug.
Using saxutils.escape() is correct for text nodes:
data.append('<%s>%s</%s>\n' % (tag, xmlescape(self.h.unescape(value)), closingTag))
And is not suficient

for attribute values:
data.append('<meta name="%s" content="%s" />\n' % (name, xmlescape(self.h.unescape(value))))

I later case you need also escape " as " and ' as '
I sugest you use quoteattr() for atributes instead of escape()

4. mobi_unpack.py:621 Why you don't use setsectiondescription() method? The same with 6 other ocations in the same

file.

5. mobi_unpack.py:704 Redundant call. the same 696, 697, 698

6. mobi_unpack.py:905 method is never used

7. mobi_unpack.py:608 duplicate map entry

KevinH · 12-31-2012, 10:49 AM

Hi Sergey,

> 1. mobi_ncx.py:9 we don't need to import readTagSection, getVariableWidthValue to this module.

Yes, since refactored earlier, these are no longer needed

>2. Program can print nice disagnostic. The problem is that it prints UTF-8 strings to console. This works only for english text (at list on WIndows). When I debug Russian books I see less readable debug output.

No actually utf-8 should be able to represent any character in any language. The problem is Windows does not use cp65001 (utf-8) for its console but some other cp, that can not represent all possible chars. Then Windows allows filenames and paths to have full unicode names that can not be represented by their current limited 8-bit encoding. This is a serious bug as you can be sent files that you can not access in any way in python or the console.

Using utf-8 (cp65001) should allow python code to access any file or path on your system even if written in Japanese or Chinese let alone Russian. I was hoping that since the Tk widgets in the Mobi_Unpack GUI use utf-8 internally, that when you use the GUI front-end to Mobi_Unpack, it should show characters properly in the Log window no matter what (unless you have non-unicode capable fonts installed).

If you use the command line/console, the user should be able to change the cp to be 65001 (utf-8) and have things work for any file or path in command line/console mode. I might be able to wrapper this for stdout so it converts back to console encoding but the better solution is to use a suitable encoding for the console that can represent all characters (cp65001 = utf-8).

So if you get a chance, please try it both ways and see what it takes to get both the console and gui mode to work properly.

The real problem is Windows allows full unicode file and path names but then uses a console encoding (and possibly fonts) that will not properly show the full range of characters. This is silly in the extreme (imho).

> 3. escape/unescape in OPF. You recently added HTMLParser.unescape().
> Are you sure that original values are
>
> escaped? Unescaping on not escaped values would be a bug.
> Using saxutils.escape() is correct for text nodes:
> data.append('<%s>%s</%s>\n' % (tag,
> xmlescape(self.h.unescape(value)), closingTag))
> And is not suficient
>
> for attribute values:
> data.append('<meta name="%s" content="%s" />\n' % (name,
> xmlescape(self.h.unescape(value))))
>
> I later case you need also escape " as " and ' as '
> I sugest you use quoteattr() for atributes instead of escape()

DiapDealer is working on trying to fix the problem in the opf of some Mobi ebooks including html in the metadata when they technically should not. Since the opf is an xml document, we can not allow any html into the metadata values we will then convert into the proper xml opf entries.

I am not up-to-speed on what he wants to do here so I will ask DiapDealer to look at this again to make sure your concerns are dealt with.

> 4. mobi_unpack.py:621 Why you don't use setsectiondescription() method?
> The same with 6 other ocations in the same file.

fixed

> 5. mobi_unpack.py:704 Redundant call. the same 696, 697, 698

removed since duplicated in init, ditto for the others

> 6. mobi_unpack.py:905 method is never used

it is used when debugging the rawml, it is just not used in this version of the file. keeping it causes no harm.

> 7. mobi_unpack.py:608 duplicate map entry[/QUOTE]

fixed by removing duplicate.

Thanks!

KevinH

DiapDealer · 12-31-2012, 02:11 PM

Quote:

Originally Posted by Sergey Dubinets

3. escape/unescape in OPF. You recently added HTMLParser.unescape(). Are you sure that original values are

escaped? Unescaping on not escaped values would be a bug.
Using saxutils.escape() is correct for text nodes:
data.append('<%s>%s</%s>\n' % (tag, xmlescape(self.h.unescape(value)), closingTag))

Quote:

Originally Posted by KevinH

I am not up-to-speed on what he wants to do here so I will ask DiapDealer to look at this again to make sure your concerns are dealt with.

I'm not certain I understand the logic of this statement:

Quote:

"Are you sure that original values are escaped? Unescaping on not escaped values would be a bug."

There is no "bug" that I can discern. HTMLParser's mostly undocumented "unescape" method is perhaps titled a bit misleading-ly? It's essentially an un-entity routine. And it's perfectly capable of dealing with "not escaped values."

Many Kindle books are starting to come down the pike with html and/or entities in the MOBI/KF8 EXTH metadata. While that may be acceptable in a MOBI/KF8 file, it's unacceptable according to XML/OPF specs (other than the standard 5 entities for XML). I see no point in creating a non-compliant OPF file, so...

If there are no named/numbered entities in the contents of the metadata, then HTMLParser.unescape() will simply have no effect on it. Nothing. No bug. If there ARE any named/numbered entities, however... HTMLParser.unescape() will first convert them all to their unicode/utf-8 counterpart character representations. Saxutils.escape() then takes care of xml-escaping the mandatory (< > &) characters to complete all XML/OPF compliance.

Descriptions often contain html paragraph formatting and the current method ensures that all html tags will be properly xml-escaped while at the same time, not completely destroying the intention of any unsupported (unsupported in XML/OPF) entities that may have been present in the MOBI/KF8 EXTH metadata.

I agree it may be overkill (some things could conceivably go from entity to character and back to entity, for instance). But I see no other method (meaning other standard python library method) to ensure that every potentially non-compliant hodge-podge of text, html, and entities becomes docile, XML/OPF-compliant entries.

Quote:

And is not suficient for attribute values:
data.append('<meta name="%s" content="%s" />\n' % (name, xmlescape(self.h.unescape(value))))

I later case you need also escape " as " and ' as '
I sugest you use quoteattr() for atributes instead of escape()

I take your point here. I've just not really run into any standard quotes (character or entity) bound for OPF meta attribute values before. I've only ever encountered them in stuff bound for OPF dc:metadata tags where they're not part of any quoted attribute values. That certainly doesn't mean they can't show up and blow things up, though.

But I'm not certain quoteattr() is the right approach, though -- as it can potentially change double-quotes to single quotes and vice-versa, depending on the situation. In such a case, I think it would make more sense to extend the escape() method by passing it the optional "entities" dictionary parameter, so that " and ' are xml-escaped as well as the three mandatory < > and &, rather than potentially changing double quotes to single quotes.

Code:

ENTITIES = {'"':'&quot;', "'":"&apos;"}
data.append('<meta name="%s" content="%s" />\n' % (name, xmlescape(self.h.unescape(value), ENTITIES)))

12-27-2012, 03:56 PM	#456
Sergey Dubinets Junior Member Posts: 6 Karma: 10 Join Date: Dec 2012 Device: Kindle	A few (possible) bugs I noticed reading the code. 1. PalmdocReader misses the case where c == 0 2. MobiHeader.__init__() assignes self.othidx = 0xfffffff instead of 0xffffffff 3. getLanguage 26 has two entries. 4. mobi_unpack.py:727 "# bytes 19 - 23: start of xor string" => "# bytes 20 - 23: start of xor string" 5. mobi_k8proc.__init__() adds 0xfffffff (instead of 0xffffffff) to the end of self.fdsttbl list 6. getVariableWidthValue() and readTagSection() are defined twice: in mobi_utils.py and mobi_index.py 7. countSetBits() is defined twice: in mobi_index.py and mobi_dict.py; It doesn't need to be a member functions. 8. the same with getTagMap(). 9. This is not the most optimum way to write countSetBits(). See bellow (sorry for C#). 10. num += 1 at the end of parseNCX() is redundant 11. re: '''<[^<>]+filepos=['"]{0,1}(\d+)[^<>]*>'''. What about closing quite or apos? 12. re: join removing empty anchors in sigle substitution. Existent re doesn't handle all all possible WS. 13. mobi_opf.py:127. print format parameters are missing. 14. mobi_opf.py:51 escape() function is for escaping HTML text nodes not attribute values. It doesn't escape " to " 15. mobi_opf.py:222 tries to find 'StartOffset' in the metadata. This is hopeless becuase all keys including ('StartOffset') ware deleted at line 154. public static int countSetBits(int value) { int count = 0; while (value != 0) { count ++; value &= value - 1; // "eats" lowest 1 bit in the value } return count; }

12-28-2012, 01:03 AM	#457
KevinH Sigil Developer Posts: 8,093 Karma: 5450184 Join Date: Nov 2009 Device: many	your comments Hi, Thanks for catching all of these! Nice job! I will incorporate the appropriate fixes into my most recent tree which has many other bug fixes and make a new release hopefully in a week or two. Take care, KevinH

12-28-2012, 12:37 PM	#458
KevinH Sigil Developer Posts: 8,093 Karma: 5450184 Join Date: Nov 2009 Device: many	Thanks for Your Bug Report Hi Sergey, Your version is a bit older than my version as line numbers do not match up. Did you use Mobi_Unpack v59 or an earlier version? > 1. PalmdocReader misses the case where c == 0 Doesn't the case c < 128 handle this? What am I missing? > 2. MobiHeader.__init__() assignes self.othidx = 0xfffffff instead of 0xffffffff fixed > 3. getLanguage 26 has two entries. fixed: merged into single table entry > 4. mobi_unpack.py:727 "# bytes 19 - 23: start of xor string" => "# bytes 20 - 23: start of xor string" typo fixed > 5. mobi_k8proc.__init__() adds 0xfffffff (instead of 0xffffffff) to the end of self.fdsttbl list this was already fixed in my version > 6. getVariableWidthValue() and readTagSection() are defined twice: in mobi_utils.py and mobi_index.py moved to mobi_index and removed from mobi_utils > 7. countSetBits() is defined twice: in mobi_index.py and mobi_dict.py; It doesn't need to be a member functions. changed to non-member function in mobi_index and removed from mobi_dict > 8. the same with getTagMap(). changed to non-member function in mobi_index and removed from mobi_dict > 9. This is not the most optimum way to write countSetBits(). See bellow (sorry for C#). mask and shift version is normal way (easily understood) to do this and works well (not the bottleneck in execution speed) > 10. num += 1 at the end of parseNCX() is redundant fixed: removed last line > 11. re: '''<[^<>]+filepos=['"]{0,1}(\d+)[^<>]>'''. What about closing quite or apos? this is fine, we are just capturing the digits and any closing ['"] captured by [^<>] > 12. re: join removing empty anchors in sigle substitution. Existent re doesn't handle all all possible WS. I am not sure about this one. Exactly which file and which line are you talking about here? Can you give me more specifics? > 13. mobi_opf.py:127. print format parameters are missing. > 14. mobi_opf.py:51 escape() function is for escaping HTML text nodes not attribute values. It doesn't escape " to " mobi_opf has recently been rewritten to properly escape things so I think this has been taken care of. > 15. mobi_opf.py:222 tries to find 'StartOffset' in the metadata. This is hopeless becuase all keys including ('StartOffset') ware deleted at line 154. yes, as was noted in the code, this one fixed by adding self.starting_offset which is initialized as None and set when processing the first time so it is available later So I think all of these changes have been made except for your number 12 and possibly for your number 1. Can you add some more detail there? Thanks, KevinH

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can i rotate text and insert images in Mobi and EPUB?	JanGLi	Kindle Formats	5	02-02-2013 04:16 PM
PDF to Mobi with text and images	pocketsprocket	Kindle Formats	7	05-21-2012 07:06 AM
Mobi files - images	DWC	Introduce Yourself	5	07-06-2011 01:43 AM
pdf to mobi... creating images rather than text	Dumhed	Calibre	5	11-06-2010 12:08 PM
Transfer of images on text files	anirudh215	PDF	2	06-22-2009 09:28 AM

12-13-2012, 09:14 AM	#453
KevinH Sigil Developer Posts: 8,093 Karma: 5450184 Join Date: Nov 2009 Device: many	Hi, I never noticed that before. I put the diff snippet in a code block and hopefully the auto smileys are gone! Thanks, KevinH

12-30-2012, 03:05 AM	#459
Sergey Dubinets Junior Member Posts: 6 Karma: 10 Join Date: Dec 2012 Device: Kindle	I am at v.0.58 I believe. 1. & 11. agree. Sorry for false alarm. 12. srctext = re.sub(r"<a/>",r"", srctext) srctext = re.sub(r"<a ?></a>",r"", srctext) "<a />", "<a> </a>" would not be removed but are as empty as "<a ></a>". It's not a perf bottle neck for sure, but you may consider matching both empty tags in single expression, like "(<a\s/>)\|(<a\s>\s*</a>)". I'll upgrade to 0.59 now. Thanks.

12-30-2012, 11:54 AM	#460
KevinH Sigil Developer Posts: 8,093 Karma: 5450184 Join Date: Nov 2009 Device: many	Hi, If you can wait a bit, I will post a Mobi_Unpack_experimental with those fixes and many more, plus using multiprocessing in place of subprocess calls in the gui wrapper, that should allow for better unicode support in filenames and paths. I would love to get feedback on it before releasing a new v0.60 version. Thanks, Kevin

12-31-2012, 04:04 AM	#462
Sergey Dubinets Junior Member Posts: 6 Karma: 10 Join Date: Dec 2012 Device: Kindle	v0.61beta works well. Here are some comments so far: 1. mobi_ncx.py:9 we don't need to import readTagSection, getVariableWidthValue to this module. 2. Program can print nice disagnostic. The problem is that it prints UTF-8 strings to console. This works only for english text (at list on WIndows). When I debug Russian books I see less readable debug output. 3. escape/unescape in OPF. You recently added HTMLParser.unescape(). Are you sure that original values are escaped? Unescaping on not escaped values would be a bug. Using saxutils.escape() is correct for text nodes: data.append('<%s>%s</%s>\n' % (tag, xmlescape(self.h.unescape(value)), closingTag)) And is not suficient for attribute values: data.append('<meta name="%s" content="%s" />\n' % (name, xmlescape(self.h.unescape(value)))) I later case you need also escape " as " and ' as ' I sugest you use quoteattr() for atributes instead of escape() 4. mobi_unpack.py:621 Why you don't use setsectiondescription() method? The same with 6 other ocations in the same file. 5. mobi_unpack.py:704 Redundant call. the same 696, 697, 698 6. mobi_unpack.py:905 method is never used 7. mobi_unpack.py:608 duplicate map entry

12-31-2012, 10:49 AM	#464
KevinH Sigil Developer Posts: 8,093 Karma: 5450184 Join Date: Nov 2009 Device: many	Hi Sergey, > 1. mobi_ncx.py:9 we don't need to import readTagSection, getVariableWidthValue to this module. Yes, since refactored earlier, these are no longer needed >2. Program can print nice disagnostic. The problem is that it prints UTF-8 strings to console. This works only for english text (at list on WIndows). When I debug Russian books I see less readable debug output. No actually utf-8 should be able to represent any character in any language. The problem is Windows does not use cp65001 (utf-8) for its console but some other cp, that can not represent all possible chars. Then Windows allows filenames and paths to have full unicode names that can not be represented by their current limited 8-bit encoding. This is a serious bug as you can be sent files that you can not access in any way in python or the console. Using utf-8 (cp65001) should allow python code to access any file or path on your system even if written in Japanese or Chinese let alone Russian. I was hoping that since the Tk widgets in the Mobi_Unpack GUI use utf-8 internally, that when you use the GUI front-end to Mobi_Unpack, it should show characters properly in the Log window no matter what (unless you have non-unicode capable fonts installed). If you use the command line/console, the user should be able to change the cp to be 65001 (utf-8) and have things work for any file or path in command line/console mode. I might be able to wrapper this for stdout so it converts back to console encoding but the better solution is to use a suitable encoding for the console that can represent all characters (cp65001 = utf-8). So if you get a chance, please try it both ways and see what it takes to get both the console and gui mode to work properly. The real problem is Windows allows full unicode file and path names but then uses a console encoding (and possibly fonts) that will not properly show the full range of characters. This is silly in the extreme (imho). > 3. escape/unescape in OPF. You recently added HTMLParser.unescape(). > Are you sure that original values are > > escaped? Unescaping on not escaped values would be a bug. > Using saxutils.escape() is correct for text nodes: > data.append('<%s>%s</%s>\n' % (tag, > xmlescape(self.h.unescape(value)), closingTag)) > And is not suficient > > for attribute values: > data.append('<meta name="%s" content="%s" />\n' % (name, > xmlescape(self.h.unescape(value)))) > > I later case you need also escape " as " and ' as ' > I sugest you use quoteattr() for atributes instead of escape() DiapDealer is working on trying to fix the problem in the opf of some Mobi ebooks including html in the metadata when they technically should not. Since the opf is an xml document, we can not allow any html into the metadata values we will then convert into the proper xml opf entries. I am not up-to-speed on what he wants to do here so I will ask DiapDealer to look at this again to make sure your concerns are dealt with. > 4. mobi_unpack.py:621 Why you don't use setsectiondescription() method? > The same with 6 other ocations in the same file. fixed > 5. mobi_unpack.py:704 Redundant call. the same 696, 697, 698 removed since duplicated in init, ditto for the others > 6. mobi_unpack.py:905 method is never used it is used when debugging the rawml, it is just not used in this version of the file. keeping it causes no harm. > 7. mobi_unpack.py:608 duplicate map entry[/QUOTE] fixed by removing duplicate. Thanks! KevinH