Quote:
Originally Posted by howsey
Thanks for that. I've now got it working reasonably well. The next issue is that the article contains hyperlinks. The default processing seems to be to replace these with the element text and then include the url in brackets afterwards. Is there a way to stop the url coming out. My initial thought was to try the pre/post processing functions but this appears to filter out way too early.
|
Code:
def preprocess_html(soup):
for a in soup.findAll('a', href=True): a['href'] = ''
return soup