Hi All!

I'm trying to clean up some really messy HTML newspaper site's page. They are heavily using tables.
In my recipe I was able to find the needed content, and extract it via keeponly_tags, and remove_tags.
Spoiler:
Code:
keep_only_tags = [
dict(name='td', attrs={'class':['content']}) ,
]
remove_tags = [
dict(name='div', attrs={'class':['ad-container-outer',\
'tags noborder',\
'video-container',\
'h']}) ,
dict(name='div', attrs={'style':['width:17px; height:17px; background-color:#8D0648; margin-bottom:25px; float:right;']}) ,
dict(name='td', attrs={'class':['foot']}) ,
dict(name='tfoot', ) ,
]
But the article(s) are in an inner table/(thead|tr/td). Which - if I convert the recipe to mobi for my Kindle - doesn't look good. Actually Only the first screen is filled with the text, and the second page is empty.
So I tried to get rid of the unnecessary tags, but without luck.
I tried postprocess_html:
But it gave me a TypeError:
Then I had tried preprocess_regexps, but it gave me empty article pages
Spoiler:
Code:
preprocess_regexps = [
(re.compile(r'<table.*?>', re.IGNORECASE), lambda match: '<div>'),
(re.compile(r'</table.*?>', re.IGNORECASE), lambda match: '</div>'),
(re.compile(r'<thead.*?>', re.IGNORECASE), lambda match: '<div>'),
(re.compile(r'</thead.*?>', re.IGNORECASE), lambda match: '</div>'),
(re.compile(r'<tfoot.*?>', re.IGNORECASE), lambda match: '<div>'),
(re.compile(r'</tfoot.*?>', re.IGNORECASE), lambda match: '</div>'),
(re.compile(r'<tr.*?>', re.IGNORECASE), lambda match: '<div>'),
(re.compile(r'</tr.*?>', re.IGNORECASE), lambda match: '</div>'),
(re.compile(r'<td.*?>', re.IGNORECASE), lambda match: '<div>'),
(re.compile(r'</td.*?>', re.IGNORECASE), lambda match: '</div>')
]
The recipe in its actual state (which works fine if you are creating e.g. PDF output) can be reached here:
https://github.com/zsoltika/.hu-reci...0_1_nap.recipe
So my question is: after cleaning up the articles html via keeponly_tags, and remove_tags, how does one replace some tags - in my case: table, thead, tfoot, tr, td; BUT only the tag names, not their contents! - with another tag name (e.g. </?span>)?
And one more thing popped into my mind: wouldn't it be nicer, if the various api callables/overrides etc. at
http://calibre-ebook.com/user_manual/news_recipe.html will be numbered? I mean I don't get which applies earlier in the process from ['preprocess_html', 'preprocess_regexps', 'keeponly_tags', 'remove_tags'].
Thanks for any help!