View Single Post
Old 12-21-2010, 03:11 PM   #4
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by mufc View Post
Hope I can find somewhere to learn that
Here's an example from a recipe:
Code:
    preprocess_regexps = [
        (re.compile(r'<body.*?<div class="pad_10L10R">', re.DOTALL|re.IGNORECASE), lambda match: '<body><div>'),
        (re.compile(r'</div>.*</body>', re.DOTALL|re.IGNORECASE), lambda match: '</div></body>'),
        (re.compile('\r'),lambda match: ''),
        (re.compile(r'<!-- .+? -->', re.DOTALL|re.IGNORECASE), lambda match: ''),
        (re.compile(r'<link .+?>', re.DOTALL|re.IGNORECASE), lambda match: ''),
        (re.compile(r'<script.*?</script>', re.DOTALL|re.IGNORECASE), lambda match: ''),
        (re.compile(r'<noscript.*?</noscript>', re.DOTALL|re.IGNORECASE), lambda match: ''),
        (re.compile(r'<meta .*?/>', re.DOTALL|re.IGNORECASE), lambda match: ''),
    ]
In the first one, he's deleting class="pad_10L10R" from a div and stuff in the <body> tag before that div. In the second he's deleting stuff in the body tag after the div closes. The others just delete things. Brute force regex with preprocess_regexps is the last resort, but it works great when you need it. Just be careful not to delete partial tags. If you delete the open tag, delete the closing part of that tag, too.
Starson17 is offline   Reply With Quote