Hi,
Yes, xpml2xhtml.py is in no way only my work. I have literally exchanged ideas and code with "user_none" and borrowed ideas from "WayneD's" perl pml2html.pl conversion program, took ideas and code posted on the Dark Blog by others, and of course started with the original code posted on the Dark blog.
I just now borrowed the idea of cleaning up chars. I hated to touch the pml file produced since that is the original. But I now have added the following to my latest version of xpml2xhtml.pl that literally cleans up the last issue I was having that forced me to use tidy (handling those special win1252 chars)
Based on Jim and user_none comments above, I have added:
def cleanupHighChars(src):
# convert special win152 chars 0x80 - 0xa0 to be properly handled later
src = re.sub('[\x80-\xa0]', lambda x: '\\a%03d' % ord(x.group()), src)
src = re.sub('[^\x00-\xff]', lambda x: '\\U%04x' % ord(x.group()), src)
return src
which when it finds these special win1252 chars it recodes them to more proper pml with the \a and \U tags and then have expanded the pml_chars array as follows based on the following win1252 page:
http://www.microsoft.com/globaldev/r.../sbcs/1252.htm
which gives me
pml_chars = {
128:'€', 129:'', 130:'—',131:'ƒ',132:'„',
133:'…', 134:'†',135:'‡',136:'ˆ',137: '‰',
138: 'Š', 139:'‹', 140:'Œ', 141:'', 142:'Ž' ,
143: '', 144:'', 145:'‘', 146:'’', 147:'“',
148:'”', 149:'•', 150: '–', 151: '—', 152: '',
153:'™', 154:'š', 155:'›', 156:'œ', 157:'',
158:'ž', 159:'Ÿ', 160:' ',
}
Then I handle all of the \a tags values by translating them
elif cmd == 'a':
final += self.pml_chars.get(attr, '&#%d;' % attr)
So I can now properly handle all of those special win1252 chars that are not allowed to be encoded in unicode just by value and that need to be remapped to special html codes.
So now, I can modify the program to use an option --use-tidy flag if that will default to no, so that the code is useable even by people without tidy.
That said, I like to see the structure when I look at an html file and tidy's nice indentation and wrapping makes for easily understood code (i.e. makes it easy to see html breakpoints).
I will test my new code further and post a final version over the weekend.
Thanks for all of the code tips and ideas.
KevinH