Concerning the correction of incomplete (relative) references, it seems that the current version only corrects links in <a>-tags but not in <area> or <link>-tags.
The links in these tags could be corrected by adding
# <area> references
for area in soup.findAll('area', href=lambda x: x and x.startswith('/')):
href = area['href']
if href.startswith('//'):
area['href'] = 'https:' + href
elif url_prefix:
area['href'] = url_prefix + area['href']
# <link> references
for link in soup.findAll('link', href=lambda x: x and x.startswith('/')):
href = link['href']
if href.startswith('//'):
link['href'] = 'https:' + href
elif url_prefix:
link['href'] = url_prefix + link['href']
in wikipedia.recipe
after the code
for a in soup.findAll('a', href=lambda x: x and x.startswith('/')):
href = a['href']
if href.startswith('//'):
a['href'] = 'https:' + href
elif url_prefix:
a['href'] = url_prefix + a['href']
I guess the code could be simplified, but it is just a quick workaround for the time being.
McDummy
|