View Single Post
Old 06-12-2013, 01:15 PM   #1
dkfurrow
Member
dkfurrow began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
Parsing Chron.com with Beautiful Soup

Hello;
First time poster, and relatively new to python, so please bear with me.
I have a simple python script to scrape Chron.com for useful links, which I wish to use to create an epub. That script is posted below.

I'm using BeautifulSoup V4, and parsing the pages as xml files. Leaving out the "xml" option gives poor results. BeautifulStoneSoup is deprecated for version 4, but it gives the same results as the "xml" option in BeautifulSoup, as I'd expect.

It appears to me that Calibre is using version 3 of BeautifulSoup, and it's not giving me the same results as the script. How can I address this? If the answer is "use lxml" what is the proper import statement for the recipe?

Thanks,
Dale
Code:
from bs4 import BeautifulSoup  
from bs4 import Tag
import re
import urllib2

print 'go to Chron sites, scrape them for useful links'
baseUrl = 'http://www.chron.com'

pages = {'news' : '/news/houston-texas/', 
         'business' : '/business/', 
         'opinion': '/opinion/', 
         'sports': '/sports/'}
page_links = dict()        
        
for page in pages.keys():
    url = urllib2.urlopen(baseUrl + pages[page])
    content = url.read()
    soup = BeautifulSoup(content, "xml")
    divs = soup.findAll('div', attrs={'class': re.compile('simplelist|scp-feature')})
    links_dict = {}
    for div in divs:
        print 'Page: ', page, ' div: ', div['class'], ' Number of Children: ', len(div.findChildren()) 
        for element in div.descendants:
            if isinstance(element, Tag) and element.name == u'a' and len(element['href']) > 10:
                if len(element.contents[0]) > 10:
                    links_dict[baseUrl + element['href']] = element.contents[0]
    page_links[page] = links_dict                
            

print 'Here is the result of the web scrape'
for page in page_links.keys():
    links_dict = page_links[page]
    for link in links_dict:
        print page, " | ", links_dict[link], " | ", link
dkfurrow is offline   Reply With Quote