Quote:
Originally Posted by scissors
Hi Chaps.
Can someone confirm that
preprocess_regexps = [
(re.compile(r'<head>.*</head>', re.IGNORECASE | re.DOTALL), lambda match: '<head></head>')]
OR
preprocess_regexps = [
(re.compile(r'<head>.*?</head>', re.IGNORECASE | re.DOTALL), lambda match: '<head></head>')]
Should totally remove a downloaded pages <head> section.
|
Not necessarily. The head tag might have attributes. I'd have checked to be sure, but this should work:
Code:
preprocess_regexps = [(re.compile(r'<head.*</head>', re.IGNORECASE | re.DOTALL), lambda match: '<head></head>')]