View Single Post
Old 11-15-2012, 04:01 PM   #445
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 8,298
Karma: 36122446
Join Date: Jan 2010
Device: Kindle Fire HD, Kindle 2
I'm starting to get the idea that I'm chasing my own tail with regard to ensuring compliant OPF files.

I thought the escape method from the standard xml.sax library was working quite well on metadata items—and it is, in fact, converting all instances of '&' and '<' or '>' to xml compliant entities as it was intended. But I'm discovering that a lot of metadata out there (especially KF8 subjects/descriptions) seem to contain html entities. This, by itself, wouldn't pose a problem. The problem is that my xml escape method is dutifully whacking all the ampersands in those poor defenseless entities and turning them into gibberish, basically.

So in one more attempt to overthink a process... enter the criminally underutilized (not to mention unsung) "unescape" method of Python's HTMLParser module. The unescape method first converts all entities that may be present in the data to their unicode character representations (OPF files are utf-8/16 by spec, afterall). Only then does the xml escape method fixup any stray ampersands and/or left/right angle brackets.

All this rambling means that I have an updated mobi_opf.py script for you to consider, pdurrant.
Attached Files
File Type: zip mobi_opf.zip (3.2 KB, 55 views)
DiapDealer is offline   Reply With Quote