MobileRead Forums - View Single Post

dkfurrow · 06-12-2013, 02:15 PM

Hello;
First time poster, and relatively new to python, so please bear with me.
I have a simple python script to scrape Chron.com for useful links, which I wish to use to create an epub. That script is posted below.

I'm using BeautifulSoup V4, and parsing the pages as xml files. Leaving out the "xml" option gives poor results. BeautifulStoneSoup is deprecated for version 4, but it gives the same results as the "xml" option in BeautifulSoup, as I'd expect.

It appears to me that Calibre is using version 3 of BeautifulSoup, and it's not giving me the same results as the script. How can I address this? If the answer is "use lxml" what is the proper import statement for the recipe?

Thanks,
Dale

Code:

from bs4 import BeautifulSoup  
from bs4 import Tag
import re
import urllib2

print 'go to Chron sites, scrape them for useful links'
baseUrl = 'http://www.chron.com'

pages = {'news' : '/news/houston-texas/', 
         'business' : '/business/', 
         'opinion': '/opinion/', 
         'sports': '/sports/'}
page_links = dict()        
        
for page in pages.keys():
    url = urllib2.urlopen(baseUrl + pages[page])
    content = url.read()
    soup = BeautifulSoup(content, "xml")
    divs = soup.findAll('div', attrs={'class': re.compile('simplelist|scp-feature')})
    links_dict = {}
    for div in divs:
        print 'Page: ', page, ' div: ', div['class'], ' Number of Children: ', len(div.findChildren()) 
        for element in div.descendants:
            if isinstance(element, Tag) and element.name == u'a' and len(element['href']) > 10:
                if len(element.contents[0]) > 10:
                    links_dict[baseUrl + element['href']] = element.contents[0]
    page_links[page] = links_dict                
            

print 'Here is the result of the web scrape'
for page in page_links.keys():
    links_dict = page_links[page]
    for link in links_dict:
        print page, " | ", links_dict[link], " | ", link