Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Kindle Formats

Notices

Reply
 
Thread Tools Search this Thread
Old 12-12-2012, 01:55 PM   #451
KevinH
Wizard
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 1,077
Karma: 444444
Join Date: Nov 2009
Device: many
Hi,

Code:
-            name = txtdata[offset:offset+ilen]
+            name = unicode(txtdata[offset:offset+ilen], 'windows-1252').encode('utf-8')
I do not think CTOC in index sections are always encoded as windows-1252.
I think the mobi header gives the proper encoding. If so, we will need to pass the encoding from mobi_unpack into the mobi_ncx and mobi_opf and convert from bytestring in the specified encoding to utf-8 bytestring.

Kevin



Quote:
Originally Posted by nleblanc88 View Post
I'd like to contribute v060 if I could. What this version fixes:

--

Encoding chapter names in UTF-8. This fixes NCX and OPF files from being encoded in non UTF-8 encodings.

--

From my test, chapter names with UTF-8 characters were not being written properly to the resulting .NCX file. This causes the file charset to be "unknown-8bit", and trying to parse these files would result in errors.

This patch fixes this issue. I've attached the source.

--

I'd also like to bring up the idea of setting up a git repository for this project(bitbucket.com or github.com). I'd love to keep contributing to this project, and I think this would not only make it easier for me and others to do so, but also help the author keep track of all versions. I'd be willing to set this up if anybody would like.

Last edited by KevinH; 12-13-2012 at 10:12 AM. Reason: fix for auto smileys
KevinH is offline   Reply With Quote
Old 12-12-2012, 04:59 PM   #452
snarkophilus
Wannabe Connoisseur
snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.
 
Posts: 242
Karma: 1009530
Join Date: Apr 2011
Location: Geelong, Australia
Device: Sony PRS-T1, Sony PRS-350, Palm TX
Quote:
Originally Posted by KevinH View Post
- name = txtdata[offsetffset+ilen]
+ name = unicode(txtdata[offsetffset+ilen], 'windows-1252').encode('utf-8')
Gotta love those automagic smileys .

Cheers,
Simon.
snarkophilus is offline   Reply With Quote
Old 12-13-2012, 10:14 AM   #453
KevinH
Wizard
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 1,077
Karma: 444444
Join Date: Nov 2009
Device: many
Hi,

I never noticed that before. I put the diff snippet in a code block and hopefully the auto smileys are gone!

Thanks,

KevinH
KevinH is offline   Reply With Quote
Old 12-13-2012, 05:16 PM   #454
DaleDe
Grand Sorcerer
DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.
 
DaleDe's Avatar
 
Posts: 9,772
Karma: 5072196
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2
Quote:
Originally Posted by KevinH View Post
Hi,

I never noticed that before. I put the diff snippet in a code block and hopefully the auto smileys are gone!

Thanks,

KevinH

Just look under Additional Options and turn off smilies in text.
DaleDe is offline   Reply With Quote
Old 12-13-2012, 08:29 PM   #455
KevinH
Wizard
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 1,077
Karma: 444444
Join Date: Nov 2009
Device: many
Hi,

Quote:
Originally Posted by DaleDe View Post
Just look under Additional Options and turn off smilies in text.
Will do. Thanks for the tip.

KevinH
KevinH is offline   Reply With Quote
Old 12-27-2012, 04:56 PM   #456
Sergey Dubinets
Junior Member
Sergey Dubinets began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Dec 2012
Device: Kindle
A few (possible) bugs I noticed reading the code.

1. PalmdocReader misses the case where c == 0
2. MobiHeader.__init__() assignes self.othidx = 0xfffffff instead of 0xffffffff
3. getLanguage 26 has two entries.
4. mobi_unpack.py:727 "# bytes 19 - 23: start of xor string" => "# bytes 20 - 23: start of xor string"
5. mobi_k8proc.__init__() adds 0xfffffff (instead of 0xffffffff) to the end of self.fdsttbl list
6. getVariableWidthValue() and readTagSection() are defined twice: in mobi_utils.py and mobi_index.py
7. countSetBits() is defined twice: in mobi_index.py and mobi_dict.py; It doesn't need to be a member functions.
8. the same with getTagMap().
9. This is not the most optimum way to write countSetBits(). See bellow (sorry for C#).
10. num += 1 at the end of parseNCX() is redundant
11. re: '''<[^<>]+filepos=['"]{0,1}(\d+)[^<>]*>'''. What about closing quite or apos?
12. re: join removing empty anchors in sigle substitution. Existent re doesn't handle all all possible WS.
13. mobi_opf.py:127. print format parameters are missing.
14. mobi_opf.py:51 escape() function is for escaping HTML text nodes not attribute values. It doesn't escape " to &quot;
15. mobi_opf.py:222 tries to find 'StartOffset' in the metadata. This is hopeless becuase all keys including ('StartOffset') ware deleted at line 154.




public static int countSetBits(int value) {
int count = 0;
while (value != 0) {
count ++;
value &= value - 1; // "eats" lowest 1 bit in the value
}
return count;
}
Sergey Dubinets is offline   Reply With Quote
Old 12-28-2012, 02:03 AM   #457
KevinH
Wizard
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 1,077
Karma: 444444
Join Date: Nov 2009
Device: many
your comments

Hi,

Thanks for catching all of these! Nice job!

I will incorporate the appropriate fixes into my most recent tree which has many other bug fixes and make a new release hopefully in a week or two.

Take care,

KevinH
KevinH is offline   Reply With Quote
Old 12-28-2012, 01:37 PM   #458
KevinH
Wizard
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 1,077
Karma: 444444
Join Date: Nov 2009
Device: many
Thanks for Your Bug Report

Hi Sergey,

Your version is a bit older than my version as line numbers do not match up.
Did you use Mobi_Unpack v59 or an earlier version?

> 1. PalmdocReader misses the case where c == 0

Doesn't the case c < 128 handle this? What am I missing?

> 2. MobiHeader.__init__() assignes self.othidx = 0xfffffff instead of 0xffffffff

fixed

> 3. getLanguage 26 has two entries.

fixed: merged into single table entry

> 4. mobi_unpack.py:727 "# bytes 19 - 23: start of xor string" => "# bytes 20 - 23: start of xor string"

typo fixed

> 5. mobi_k8proc.__init__() adds 0xfffffff (instead of 0xffffffff) to the end of self.fdsttbl list

this was already fixed in my version

> 6. getVariableWidthValue() and readTagSection() are defined twice: in mobi_utils.py and mobi_index.py

moved to mobi_index and removed from mobi_utils


> 7. countSetBits() is defined twice: in mobi_index.py and mobi_dict.py; It doesn't need to be a member functions.

changed to non-member function in mobi_index and removed from mobi_dict

> 8. the same with getTagMap().

changed to non-member function in mobi_index and removed from mobi_dict

> 9. This is not the most optimum way to write countSetBits(). See bellow (sorry for C#).

mask and shift version is normal way (easily understood) to do this and works well (not the bottleneck in execution speed)

> 10. num += 1 at the end of parseNCX() is redundant

fixed: removed last line

> 11. re: '''<[^<>]+filepos=['"]{0,1}(\d+)[^<>]*>'''. What about closing quite or apos?

this is fine, we are just capturing the digits and any closing ['"] captured by [^<>]*

> 12. re: join removing empty anchors in sigle substitution. Existent re doesn't handle all all possible WS.

I am not sure about this one. Exactly which file and which line are you talking about here? Can you give me more specifics?


> 13. mobi_opf.py:127. print format parameters are missing.
> 14. mobi_opf.py:51 escape() function is for escaping HTML text nodes not attribute values. It doesn't escape " to &quot;

mobi_opf has recently been rewritten to properly escape things so I think this has been taken care of.

> 15. mobi_opf.py:222 tries to find 'StartOffset' in the metadata. This is hopeless becuase all keys including ('StartOffset') ware deleted at line 154.

yes, as was noted in the code, this one fixed by adding self.starting_offset which is initialized as None and set when processing the first time so it is available later


So I think all of these changes have been made except for your number 12 and possibly for your number 1. Can you add some more detail there?

Thanks,

KevinH
KevinH is offline   Reply With Quote
Old 12-30-2012, 04:05 AM   #459
Sergey Dubinets
Junior Member
Sergey Dubinets began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Dec 2012
Device: Kindle
I am at v.0.58 I believe.

1. & 11. agree. Sorry for false alarm.
12.
srctext = re.sub(r"<a/>",r"", srctext)
srctext = re.sub(r"<a ?></a>",r"", srctext)

"<a />", "<a> </a>" would not be removed but are as empty as "<a ></a>".
It's not a perf bottle neck for sure, but you may consider matching both empty tags in single expression, like "(<a\s*/>)|(<a\s*>\s*</a>)".

I'll upgrade to 0.59 now.

Thanks.
Sergey Dubinets is offline   Reply With Quote
Old 12-30-2012, 12:54 PM   #460
KevinH
Wizard
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 1,077
Karma: 444444
Join Date: Nov 2009
Device: many
Hi,

If you can wait a bit, I will post a Mobi_Unpack_experimental with those fixes and many more, plus using multiprocessing in place of subprocess calls in the gui wrapper, that should allow for better unicode support in filenames and paths.

I would love to get feedback on it before releasing a new v0.60 version.

Thanks,

Kevin
KevinH is offline   Reply With Quote
Old 12-30-2012, 02:48 PM   #461
KevinH
Wizard
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 1,077
Karma: 444444
Join Date: Nov 2009
Device: many
Hi,

Okay, here is an experimental version of Mobi_Unpack (call it v0.61beta). It should have all of the outstanding bug fixes plus some robustness improvements for official Kindle ebooks that are not quite correctly generated (thank you Kovid), mobi_opf improvements (thanks DiapDealer), fixes for Sergey's bugs (thanks), a more correct fix for bug from nleblanc88 (thanks), changes to allow internal use of utf-8 so that files and paths that require full unicode to be properly specified should now hopefully work, as well as changes to remove the need for unbuffered output via a shift to use the multiprocessing module, fixes for sometime hangs in debug mode, support for CTOC sections being properly labeled in debug mode, and etc.

It still really needs to be refactored and cleaned up but this should have everything I know about. Please give it a try. If it does not fix your bug or you run into problems of any sort, please let us know here asap.

If it passes muster, it will become version 0.61.

Thanks,

KevinH
Attached Files
File Type: zip Mobi_Unpack_experimental.zip (56.2 KB, 56 views)
File Type: zip Mobi_Unpack_ReadMe.htm.zip (1.8 KB, 48 views)

Last edited by KevinH; 12-30-2012 at 02:49 PM. Reason: fix typo
KevinH is offline   Reply With Quote
Old 12-31-2012, 05:04 AM   #462
Sergey Dubinets
Junior Member
Sergey Dubinets began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Dec 2012
Device: Kindle
v0.61beta works well.

Here are some comments so far:

1. mobi_ncx.py:9 we don't need to import readTagSection, getVariableWidthValue to this module.

2. Program can print nice disagnostic. The problem is that it prints UTF-8 strings to console. This works only for

english text (at list on WIndows). When I debug Russian books I see less readable debug output.

3. escape/unescape in OPF. You recently added HTMLParser.unescape(). Are you sure that original values are

escaped? Unescaping on not escaped values would be a bug.
Using saxutils.escape() is correct for text nodes:
data.append('<%s>%s</%s>\n' % (tag, xmlescape(self.h.unescape(value)), closingTag))
And is not suficient

for attribute values:
data.append('<meta name="%s" content="%s" />\n' % (name, xmlescape(self.h.unescape(value))))

I later case you need also escape " as &quot; and ' as &apos;
I sugest you use quoteattr() for atributes instead of escape()

4. mobi_unpack.py:621 Why you don't use setsectiondescription() method? The same with 6 other ocations in the same

file.

5. mobi_unpack.py:704 Redundant call. the same 696, 697, 698

6. mobi_unpack.py:905 method is never used

7. mobi_unpack.py:608 duplicate map entry
Sergey Dubinets is offline   Reply With Quote
Old 12-31-2012, 10:12 AM   #463
KevinH
Wizard
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 1,077
Karma: 444444
Join Date: Nov 2009
Device: many
Hi,
Thanks for your testing. I will look at all of the issues you pointed out. But I am most interested in issues with encodings. This version should work better since utf-8 can encode all possible characters. Did you run from the command line or via the gui? The gui log window should show all characters correctly. Does it?

If running from the command-line on on Windows the best way to run the program is to change your codepage to cp65001 first. If you do that does it work?

Thanks,

Kevin


Quote:
Originally Posted by Sergey Dubinets View Post
v0.61beta works well.

Here are some comments so far:

1. mobi_ncx.py:9 we don't need to import readTagSection, getVariableWidthValue to this module.

2. Program can print nice disagnostic. The problem is that it prints UTF-8 strings to console. This works only for

english text (at list on WIndows). When I debug Russian books I see less readable debug output.

3. escape/unescape in OPF. You recently added HTMLParser.unescape(). Are you sure that original values are

escaped? Unescaping on not escaped values would be a bug.
Using saxutils.escape() is correct for text nodes:
data.append('<%s>%s</%s>\n' % (tag, xmlescape(self.h.unescape(value)), closingTag))
And is not suficient

for attribute values:
data.append('<meta name="%s" content="%s" />\n' % (name, xmlescape(self.h.unescape(value))))

I later case you need also escape " as &quot; and ' as &apos;
I sugest you use quoteattr() for atributes instead of escape()

4. mobi_unpack.py:621 Why you don't use setsectiondescription() method? The same with 6 other ocations in the same

file.

5. mobi_unpack.py:704 Redundant call. the same 696, 697, 698

6. mobi_unpack.py:905 method is never used

7. mobi_unpack.py:608 duplicate map entry
KevinH is offline   Reply With Quote
Old 12-31-2012, 11:49 AM   #464
KevinH
Wizard
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 1,077
Karma: 444444
Join Date: Nov 2009
Device: many
Hi Sergey,

> 1. mobi_ncx.py:9 we don't need to import readTagSection, getVariableWidthValue to this module.

Yes, since refactored earlier, these are no longer needed

>2. Program can print nice disagnostic. The problem is that it prints UTF-8 strings to console. This works only for english text (at list on WIndows). When I debug Russian books I see less readable debug output.

No actually utf-8 should be able to represent any character in any language. The problem is Windows does not use cp65001 (utf-8) for its console but some other cp, that can not represent all possible chars. Then Windows allows filenames and paths to have full unicode names that can not be represented by their current limited 8-bit encoding. This is a serious bug as you can be sent files that you can not access in any way in python or the console.

Using utf-8 (cp65001) should allow python code to access any file or path on your system even if written in Japanese or Chinese let alone Russian. I was hoping that since the Tk widgets in the Mobi_Unpack GUI use utf-8 internally, that when you use the GUI front-end to Mobi_Unpack, it should show characters properly in the Log window no matter what (unless you have non-unicode capable fonts installed).

If you use the command line/console, the user should be able to change the cp to be 65001 (utf-8) and have things work for any file or path in command line/console mode. I might be able to wrapper this for stdout so it converts back to console encoding but the better solution is to use a suitable encoding for the console that can represent all characters (cp65001 = utf-8).

So if you get a chance, please try it both ways and see what it takes to get both the console and gui mode to work properly.

The real problem is Windows allows full unicode file and path names but then uses a console encoding (and possibly fonts) that will not properly show the full range of characters. This is silly in the extreme (imho).


> 3. escape/unescape in OPF. You recently added HTMLParser.unescape().
> Are you sure that original values are
>
> escaped? Unescaping on not escaped values would be a bug.
> Using saxutils.escape() is correct for text nodes:
> data.append('<%s>%s</%s>\n' % (tag,
> xmlescape(self.h.unescape(value)), closingTag))
> And is not suficient
>
> for attribute values:
> data.append('<meta name="%s" content="%s" />\n' % (name,
> xmlescape(self.h.unescape(value))))
>
> I later case you need also escape " as &quot; and ' as &apos;
> I sugest you use quoteattr() for atributes instead of escape()

DiapDealer is working on trying to fix the problem in the opf of some Mobi ebooks including html in the metadata when they technically should not. Since the opf is an xml document, we can not allow any html into the metadata values we will then convert into the proper xml opf entries.

I am not up-to-speed on what he wants to do here so I will ask DiapDealer to look at this again to make sure your concerns are dealt with.

> 4. mobi_unpack.py:621 Why you don't use setsectiondescription() method?
> The same with 6 other ocations in the same file.

fixed

> 5. mobi_unpack.py:704 Redundant call. the same 696, 697, 698

removed since duplicated in init, ditto for the others

> 6. mobi_unpack.py:905 method is never used

it is used when debugging the rawml, it is just not used in this version of the file. keeping it causes no harm.

> 7. mobi_unpack.py:608 duplicate map entry[/QUOTE]

fixed by removing duplicate.


Thanks!

KevinH
KevinH is offline   Reply With Quote
Old 12-31-2012, 03:11 PM   #465
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 9,516
Karma: 43764640
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Originally Posted by Sergey Dubinets
3. escape/unescape in OPF. You recently added HTMLParser.unescape(). Are you sure that original values are

escaped? Unescaping on not escaped values would be a bug.
Using saxutils.escape() is correct for text nodes:
data.append('<%s>%s</%s>\n' % (tag, xmlescape(self.h.unescape(value)), closingTag))
Quote:
Originally Posted by KevinH
I am not up-to-speed on what he wants to do here so I will ask DiapDealer to look at this again to make sure your concerns are dealt with.
I'm not certain I understand the logic of this statement:
Quote:
"Are you sure that original values are escaped? Unescaping on not escaped values would be a bug."
There is no "bug" that I can discern. HTMLParser's mostly undocumented "unescape" method is perhaps titled a bit misleading-ly? It's essentially an un-entity routine. And it's perfectly capable of dealing with "not escaped values."

Many Kindle books are starting to come down the pike with html and/or entities in the MOBI/KF8 EXTH metadata. While that may be acceptable in a MOBI/KF8 file, it's unacceptable according to XML/OPF specs (other than the standard 5 entities for XML). I see no point in creating a non-compliant OPF file, so...

If there are no named/numbered entities in the contents of the metadata, then HTMLParser.unescape() will simply have no effect on it. Nothing. No bug. If there ARE any named/numbered entities, however... HTMLParser.unescape() will first convert them all to their unicode/utf-8 counterpart character representations. Saxutils.escape() then takes care of xml-escaping the mandatory (< > &) characters to complete all XML/OPF compliance.

Descriptions often contain html paragraph formatting and the current method ensures that all html tags will be properly xml-escaped while at the same time, not completely destroying the intention of any unsupported (unsupported in XML/OPF) entities that may have been present in the MOBI/KF8 EXTH metadata.

I agree it may be overkill (some things could conceivably go from entity to character and back to entity, for instance). But I see no other method (meaning other standard python library method) to ensure that every potentially non-compliant hodge-podge of text, html, and entities becomes docile, XML/OPF-compliant entries.

Quote:
And is not suficient for attribute values:
data.append('<meta name="%s" content="%s" />\n' % (name, xmlescape(self.h.unescape(value))))

I later case you need also escape " as &quot; and ' as &apos;
I sugest you use quoteattr() for atributes instead of escape()
I take your point here. I've just not really run into any standard quotes (character or entity) bound for OPF meta attribute values before. I've only ever encountered them in stuff bound for OPF dc:metadata tags where they're not part of any quoted attribute values. That certainly doesn't mean they can't show up and blow things up, though.

But I'm not certain quoteattr() is the right approach, though -- as it can potentially change double-quotes to single quotes and vice-versa, depending on the situation. In such a case, I think it would make more sense to extend the escape() method by passing it the optional "entities" dictionary parameter, so that " and ' are xml-escaped as well as the three mandatory < > and &, rather than potentially changing double quotes to single quotes.

Code:
ENTITIES = {'"':'&quot;', "'":"&apos;"}
data.append('<meta name="%s" content="%s" />\n' % (name, xmlescape(self.h.unescape(value), ENTITIES)))

Last edited by DiapDealer; 12-31-2012 at 05:35 PM.
DiapDealer is online now   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Can i rotate text and insert images in Mobi and EPUB? JanGLi Kindle Formats 5 02-02-2013 05:16 PM
PDF to Mobi with text and images pocketsprocket Kindle Formats 7 05-21-2012 08:06 AM
Mobi files - images DWC Introduce Yourself 5 07-06-2011 02:43 AM
pdf to mobi... creating images rather than text Dumhed Calibre 5 11-06-2010 01:08 PM
Transfer of images on text files anirudh215 PDF 2 06-22-2009 10:28 AM


All times are GMT -4. The time now is 12:27 PM.


MobileRead.com is a privately owned, operated and funded community.