View Single Post
Old 12-05-2009, 12:21 PM   #66
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,859
Karma: 6120478
Join Date: Nov 2009
Device: many
more on xpml2xhtml.py

Hi,

Yes, xpml2xhtml.py is in no way only my work. I have literally exchanged ideas and code with "user_none" and borrowed ideas from "WayneD's" perl pml2html.pl conversion program, took ideas and code posted on the Dark Blog by others, and of course started with the original code posted on the Dark blog.

I just now borrowed the idea of cleaning up chars. I hated to touch the pml file produced since that is the original. But I now have added the following to my latest version of xpml2xhtml.pl that literally cleans up the last issue I was having that forced me to use tidy (handling those special win1252 chars)

Based on Jim and user_none comments above, I have added:

def cleanupHighChars(src):
# convert special win152 chars 0x80 - 0xa0 to be properly handled later
src = re.sub('[\x80-\xa0]', lambda x: '\\a%03d' % ord(x.group()), src)
src = re.sub('[^\x00-\xff]', lambda x: '\\U%04x' % ord(x.group()), src)
return src

which when it finds these special win1252 chars it recodes them to more proper pml with the \a and \U tags and then have expanded the pml_chars array as follows based on the following win1252 page:

http://www.microsoft.com/globaldev/r.../sbcs/1252.htm

which gives me

pml_chars = {
128:'€', 129:'', 130:'—',131:'ƒ',132:'„',
133:'…', 134:'†',135:'‡',136:'ˆ',137: '‰',
138: 'Š', 139:'‹', 140:'Œ', 141:'', 142:'Ž' ,
143: '', 144:'', 145:'‘', 146:'’', 147:'“',
148:'”', 149:'•', 150: '–', 151: '—', 152: '',
153:'™', 154:'š', 155:'›', 156:'œ', 157:'',
158:'ž', 159:'Ÿ', 160:' ',
}


Then I handle all of the \a tags values by translating them

elif cmd == 'a':
final += self.pml_chars.get(attr, '&#%d;' % attr)


So I can now properly handle all of those special win1252 chars that are not allowed to be encoded in unicode just by value and that need to be remapped to special html codes.

So now, I can modify the program to use an option --use-tidy flag if that will default to no, so that the code is useable even by people without tidy.

That said, I like to see the structure when I look at an html file and tidy's nice indentation and wrapping makes for easily understood code (i.e. makes it easy to see html breakpoints).

I will test my new code further and post a final version over the weekend.

Thanks for all of the code tips and ideas.

KevinH
KevinH is offline   Reply With Quote