Quote:
Originally Posted by mufc
Hope I can find somewhere to learn that
|
Here's an example from a recipe:
Code:
preprocess_regexps = [
(re.compile(r'<body.*?<div class="pad_10L10R">', re.DOTALL|re.IGNORECASE), lambda match: '<body><div>'),
(re.compile(r'</div>.*</body>', re.DOTALL|re.IGNORECASE), lambda match: '</div></body>'),
(re.compile('\r'),lambda match: ''),
(re.compile(r'<!-- .+? -->', re.DOTALL|re.IGNORECASE), lambda match: ''),
(re.compile(r'<link .+?>', re.DOTALL|re.IGNORECASE), lambda match: ''),
(re.compile(r'<script.*?</script>', re.DOTALL|re.IGNORECASE), lambda match: ''),
(re.compile(r'<noscript.*?</noscript>', re.DOTALL|re.IGNORECASE), lambda match: ''),
(re.compile(r'<meta .*?/>', re.DOTALL|re.IGNORECASE), lambda match: ''),
]
In the first one, he's deleting class="pad_10L10R" from a div and stuff in the <body> tag before that div. In the second he's deleting stuff in the body tag after the div closes. The others just delete things. Brute force regex with preprocess_regexps is the last resort, but it works great when you need it. Just be careful not to delete partial tags. If you delete the open tag, delete the closing part of that tag, too.