![]() |
#1 |
Member
![]() Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
|
Parsing Chron.com with Beautiful Soup
Hello;
First time poster, and relatively new to python, so please bear with me. I have a simple python script to scrape Chron.com for useful links, which I wish to use to create an epub. That script is posted below. I'm using BeautifulSoup V4, and parsing the pages as xml files. Leaving out the "xml" option gives poor results. BeautifulStoneSoup is deprecated for version 4, but it gives the same results as the "xml" option in BeautifulSoup, as I'd expect. It appears to me that Calibre is using version 3 of BeautifulSoup, and it's not giving me the same results as the script. How can I address this? If the answer is "use lxml" what is the proper import statement for the recipe? Thanks, Dale Code:
from bs4 import BeautifulSoup from bs4 import Tag import re import urllib2 print 'go to Chron sites, scrape them for useful links' baseUrl = 'http://www.chron.com' pages = {'news' : '/news/houston-texas/', 'business' : '/business/', 'opinion': '/opinion/', 'sports': '/sports/'} page_links = dict() for page in pages.keys(): url = urllib2.urlopen(baseUrl + pages[page]) content = url.read() soup = BeautifulSoup(content, "xml") divs = soup.findAll('div', attrs={'class': re.compile('simplelist|scp-feature')}) links_dict = {} for div in divs: print 'Page: ', page, ' div: ', div['class'], ' Number of Children: ', len(div.findChildren()) for element in div.descendants: if isinstance(element, Tag) and element.name == u'a' and len(element['href']) > 10: if len(element.contents[0]) > 10: links_dict[baseUrl + element['href']] = element.contents[0] page_links[page] = links_dict print 'Here is the result of the web scrape' for page in page_links.keys(): links_dict = page_links[page] for link in links_dict: print page, " | ", links_dict[link], " | ", link |
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,231
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
The import statement is the same as for using lxml in any python script. The recipe system does not care what you use to parse html, all it cares is that parse_index() returns the correct data structure, as documented.
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Member
![]() Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
|
Thanks for the prompt reply. Okay, I think I got it...the problem was the descendants attribute, when I used findChildren, it all worked. I posted the recipe below, will test it for awhile, then submit it for calibre inclusion...
There's no description on the the sites from which I'm obtaining the feeds, but there is a description on the feed destination...Is there any established way to handle this, other than by grabbing the text from a call within parse_index ? Code:
#!/usr/bin/env python # vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai __license__ = 'GPL v3' __copyright__ = '2013, Dale Furrow dkfurrow@gmail.com' ''' chron.com ''' import re import urllib2 from calibre.web.feeds.recipes import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag class HoustonChronicle(BasicNewsRecipe): title = u'The Houston Chronicle' description = 'News from Houston, Texas' __author__ = 'Dale Furrow' language = 'en' no_stylesheets = True #use_embedded_content = False remove_attributes = ['style'] auto_cleanup = True def parse_index(self): self.timefmt = ' [%a, %d %b, %Y]' baseUrl = 'http://www.chron.com' pages = [('news' , '/news/houston-texas/'), ('business' , '/business/'), ('opinion', '/opinion/'), ('sports', '/sports/')] feeds = [] totalLinks = 0 for page in pages: articles = [] section_links = set() url = urllib2.urlopen(baseUrl + page[1]) content = url.read() soup = BeautifulSoup(content) divs = soup.findAll('div', attrs={'class': re.compile('scp-feature|simplelist')}) for div in divs: self.log( 'Page: ', page[0], ' div: ', div['class'], ' Number of Children: ', len(div.findChildren()) ) for child in div.findChildren(): if isinstance(child, Tag) and child.name == u'a' and len(child['href']) > 10: if len(child.contents[0]) > 10 and child['href'] not in section_links: section_links.add(child['href']) if child['href'].find('http') == -1: link = baseUrl + child['href'] else: link = child['href'] title = child.contents[0] totalLinks += 1 self.log('\tFound article ', totalLinks, " at " ,title, 'at', link) articles.append({'title':title, 'url':link, 'description':'', 'date':''}) if articles: feeds.append((page[0], articles)) self.log('Found ', totalLinks, ' articles --returning feeds') return feeds |
![]() |
![]() |
![]() |
#4 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,231
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
populate_article_metadata()
|
![]() |
![]() |
![]() |
#5 |
Member
![]() Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
|
Yes that works, was able to populate summary from the article body. I'd like to populate the date as well, but it appears that the soup in populate_article_metadata has already been stripped down to the basic article body, thus removing the tags I'm interested in. I tried to use the keep_only_tags feature to add the appropriate tags to the article body...didn't work. I see another poster has the same issue w/ that feature, so I'll just watch that thread. Recipe posted below:
Code:
#!/usr/bin/env python # -*- coding: utf-8 -*- __license__ = 'GPL v3' __copyright__ = '2013, Dale Furrow dkfurrow@gmail.com' ''' chron.com ''' import re, string import urllib2 from calibre.web.feeds.recipes import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag class HoustonChronicle(BasicNewsRecipe): title = u'The Houston Chronicle' description = 'News from Houston, Texas' __author__ = 'Dale Furrow' language = 'en' no_stylesheets = True #use_embedded_content = False remove_attributes = ['style'] auto_cleanup = True def parse_index(self): self.timefmt = ' [%a, %d %b, %Y]' baseUrl = 'http://www.chron.com' pages = [('news' , '/news/houston-texas/'), ('business' , '/business/'), ('opinion', '/opinion/'), ('sports', '/sports/')] feeds = [] totalLinks = 0 for page in pages: articles = [] section_links = set() url = urllib2.urlopen(baseUrl + page[1]) content = url.read() soup = BeautifulSoup(content) divs = soup.findAll('div', attrs={'class': re.compile('scp-feature|simplelist|scp-item')}) for div in divs: self.log( 'Page: ', page[0], ' div: ', div['class'], ' Number of Children: ', len(div.findChildren()) ) for child in div.findChildren(): if isinstance(child, Tag) and child.name == u'a' and len(child['href']) > 10: if len(child.contents[0]) > 10 and child['href'] not in section_links: section_links.add(child['href']) if child['href'].find('http') == -1: link = baseUrl + child['href'] else: link = child['href'] title = child.contents[0] totalLinks += 1 self.log('\tFound article ', totalLinks, " at " ,title, 'at', link) articles.append({'title':title, 'url':link, 'description':'', 'date':''}) if articles: feeds.append((page[0], articles)) self.log('Found ', totalLinks, ' articles --returning feeds') return feeds def populate_article_metadata(self, article, soup, first): if not first: return outputParagraph = "" max_length = 210 #approximately three line of text try: if len(article.text_summary.strip()) == 0: articlebody = soup.find('body') if articlebody: paras = articlebody.findAll('p') for p in paras: refparagraph = self.tag_to_string(p,use_alt=False).strip() #account for blank paragraphs and short paragraphs by appending them to longer ones outputParagraph += (" " + refparagraph) if len(outputParagraph) > max_length: article.summary = article.text_summary = outputParagraph.strip()[0 : max_length] return else: article.summary = article.text_summary = article.text_summary except: self.log("Error creating article descriptions") return |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,231
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
If you want to use keep_only_tags you have to first disable auto_cleanup.
|
![]() |
![]() |
![]() |
#7 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,231
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Also note that when converting a URL to soup, use:
soup = self.index_to_soup(baseUrl + page[1]) |
![]() |
![]() |
![]() |
#8 |
Member
![]() Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
|
Okay, thanks very much for your help. Making progress here, code posted below...keep_only_tags is working, and the correct date is attached to each article.
I plan on removing old articles in populate_article_metadata, but haven't done that yet...I know there are some tags that still need to be removed to make things cleaner, but I have a question on a repeated error message: Traceback (most recent call last): File "site-packages\calibre\web\fetch\simple.py", line 431, in process_images File "site-packages\calibre\utils\magick\__init__.py", line 132, in load Exception: no decode delegate for this image format `' @ error/blob.c/BlobToImage/360 It doesn't halt execution, but it's difficult for me to tell where the error's coming from. Is there an attribute I can set to ignore images which can't be processed? Code:
#!/usr/bin/env python # -*- coding: utf-8 -*- __license__ = 'GPL v3' __copyright__ = '2013, Dale Furrow dkfurrow@gmail.com' ''' chron.com ''' import re, string, time import urllib2 from calibre.web.feeds.recipes import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag from calibre.utils.date import dt_factory, utcnow, local_tz def getRegularTimestamp(dateString): try: outDate = time.strptime(dateString, "%Y-%m-%dT%H:%M:%SZ") return outDate except: return None regextest = '(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|\ Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?) \ [0-9]{1,2}, 20[01][0-9]' def GetDateFromString(inText): match = re.findall(regextest, inText) if match: try: outDate = time.strptime(match[0], "%B %d, %Y") return outDate except: return None else: return None def getTimestampFromSoup (soup): timestampEle = soup.find('h5', attrs={'class': re.compile('timestamp')}) if timestampEle is not None: try: timestampText = timestampEle['title'] return getRegularTimestamp(timestampText) except: return None else: timestampEle = soup.find('span', attrs={'class': re.compile('post-date|entry-date')}) if timestampEle is not None: try: timestampText = timestampEle.string return GetDateFromString(timestampText) except: return None else: return None class HoustonChronicle(BasicNewsRecipe): title = u'The Houston Chronicle' description = 'News from Houston, Texas' __author__ = 'Dale Furrow' language = 'en' no_stylesheets = True #use_embedded_content = False remove_attributes = ['style'] #auto_cleanup = True # dict(name='div', attrs={'class':re.compile('toolsList')}) #keep_only_tags = [dict(id=['content', 'heading'])] #auto_cleanup_keep = '//*[@class="timestamp"]|//span[@class="entry-date"]|//span[@class="post-date"]' keep_only_tags = [dict(name='div', attrs={'class':re.compile('hentry')}), dict(name='span', attrs={'class':re.compile('post-date|entry-date')}), dict(name='h5', attrs={'class':re.compile('timestamp')}), dict(name='div', attrs={'id':re.compile('post-')}) ] def parse_index(self): self.timefmt = ' [%a, %d %b, %Y]' baseUrl = 'http://www.chron.com' pages = [('business' , '/business/')] feeds = [] totalLinks = 0 for page in pages: articles = [] section_links = set() #url = urllib2.urlopen(baseUrl + page[1]) #content = url.read() soup = self.index_to_soup(baseUrl + page[1]) divs = soup.findAll('div', attrs={'class': re.compile('scp-feature|simplelist|scp-item')}) for div in divs: #self.log( 'Page: ', page[0], ' div: ', div['class'], ' Number of Children: ', len(div.findChildren()) ) for child in div.findChildren(): if isinstance(child, Tag) and child.name == u'a' and len(child['href']) > 10: if len(child.contents[0]) > 10 and child['href'] not in section_links: section_links.add(child['href']) if child['href'].find('http') == -1: link = baseUrl + child['href'] else: link = child['href'] title = child.contents[0] totalLinks += 1 self.log('\tFound article ', totalLinks, " at " ,title, 'at', link) articles.append({'title':title, 'url':link, 'description':'', 'date':''}) if articles: feeds.append((page[0], articles)) self.log('Found ', totalLinks, ' articles --returning feeds') return feeds def populate_article_metadata(self, article, soup, first): if not first: return outputParagraph = "" max_length = 210 #approximately three line of text #self.log('printing article: ', article.title) # remove after debug #self.log(soup.prettify()) # remove after debug try: articleDate = getTimestampFromSoup(soup) # remove after debug except Exception as inst: # remove after debug self.log('Exception: ', article.title) # remove after debug self.log(type(inst)) # remove after debug self.log(inst) # remove after debug if articleDate is not None: dateText = time.strftime('%Y-%m-%d', articleDate) self.log(article.title, ' has timestamp of ', dateText) #self.log('Article Date is of type: ', type(article.date)) # remove after debug #self.log('Derived time is of type: ', type(articleDate)) # remove after debug try: article.date = articleDate article.utctime = dt_factory(articleDate, assume_utc=True, as_utc=True) article.localtime = article.utctime.astimezone(local_tz) except Exception as inst: # remove after debug self.log('Exception: ', article.title) # remove after debug self.log(type(inst)) # remove after debug self.log(inst) # remove after debug else: dateText = time.strftime('%Y-%m-%d', time.gmtime()) self.log(article.title, ' has no timestamp') #article.date = strftime('%a, %d %b') # remove after debug try: if len(article.text_summary.strip()) == 0: articlebody = soup.find('body') if articlebody: paras = articlebody.findAll('p') for p in paras: refparagraph = self.tag_to_string(p,use_alt=False).strip() #account for blank paragraphs and short paragraphs by appending them to longer ones outputParagraph += (" " + refparagraph) if len(outputParagraph) > max_length: article.summary = article.text_summary = outputParagraph.strip()[0 : max_length] return else: article.summary = article.text_summary = article.text_summary except: self.log("Error creating article descriptions") return |
![]() |
![]() |
![]() |
#9 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,231
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Dont worry about that error, images that cannot be processed dont affect the download process, they are simply ignored. If you really want to remove them, use preprocess_html()
|
![]() |
![]() |
![]() |
#10 |
Member
![]() Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
|
Got the tags cleaned up tolerably well...Only thing I haven't seemed to be able to do is to delete specific articles, after parsing, based on date (line 148, I'd like to delete articles with age > 2 days). I've attached the correct date to the article in populate_article_metadata...is it possible to delete the article there? If so, what's the correct syntax? Couldn't seem to make anything work to do that.
Code:
#!/usr/bin/env python # -*- coding: utf-8 -*- __license__ = 'GPL v3' __copyright__ = '2013, Dale Furrow dkfurrow@gmail.com' ''' chron.com ''' import re, string, time import urllib2 from calibre.web.feeds.recipes import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag from calibre.utils.date import dt_factory, utcnow, local_tz from datetime import datetime, timedelta def getRegularTimestamp(dateString): try: outDate = time.strptime(dateString, "%Y-%m-%dT%H:%M:%SZ") return outDate except: return None regextest = '(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|\ Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?) \ [0-9]{1,2}, 20[01][0-9]' def GetDateFromString(inText): match = re.findall(regextest, inText) if match: try: outDate = time.strptime(match[0], "%B %d, %Y") return outDate except: return None else: return None def isWithinDays(inTT,daysAgo): daysAgoDateTime = datetime.now()-timedelta(days = daysAgo) DaysAgoDateTime = datetime(inTT[0], inTT[1], inTT[2], inTT[3], inTT[4], inTT[5]) return DaysAgoDateTime > daysAgoDateTime def getTimestampFromSoup (soup): timestampEle = soup.find('h5', attrs={'class': re.compile('timestamp')}) if timestampEle is not None: try: timestampText = timestampEle['title'] return getRegularTimestamp(timestampText) except: return None else: timestampEle = soup.find('span', attrs={'class': re.compile('post-date|entry-date')}) if timestampEle is not None: try: timestampText = timestampEle.string return GetDateFromString(timestampText) except: return None else: return None class HoustonChronicle(BasicNewsRecipe): title = u'The Houston Chronicle' description = 'News from Houston, Texas' __author__ = 'Dale Furrow' language = 'en' no_stylesheets = True #use_embedded_content = False remove_attributes = ['style'] remove_empty_feeds = True keep_only_tags = [dict(name='div', attrs={'class':re.compile('hentry')}), dict(name='span', attrs={'class':re.compile('post-date|entry-date')}), dict(name='h5', attrs={'class':re.compile('timestamp')}), dict(name='div', attrs={'id':re.compile('post-')}) ] remove_tags = [dict(name='div', attrs={'class':'socialBar'}), dict(name='div', attrs={'class':re.compile('post-commentmeta')}), dict(name='div', attrs={'class':re.compile('slideshow_wrapper')})] def parse_index(self): self.timefmt = ' [%a, %d %b, %Y]' baseUrl = 'http://www.chron.com' pages = [('business' , '/business/')] feeds = [] totalLinks = 0 for page in pages: articles = [] section_links = set() #url = urllib2.urlopen(baseUrl + page[1]) #content = url.read() soup = self.index_to_soup(baseUrl + page[1]) divs = soup.findAll('div', attrs={'class': re.compile('scp-feature|simplelist|scp-item')}) for div in divs: #self.log( 'Page: ', page[0], ' div: ', div['class'], ' Number of Children: ', len(div.findChildren()) ) for child in div.findChildren(): if isinstance(child, Tag) and child.name == u'a' and len(child['href']) > 10: if len(child.contents[0]) > 10 and child['href'] not in section_links: section_links.add(child['href']) if child['href'].find('http') == -1: link = baseUrl + child['href'] else: link = child['href'] title = child.contents[0] totalLinks += 1 self.log('\tFound article ', totalLinks, " at " ,title, 'at', link) articles.append({'title':title, 'url':link, 'description':'', 'date':''}) if articles: feeds.append((page[0], articles)) self.log('Found ', totalLinks, ' articles --returning feeds') return feeds def populate_article_metadata(self, article, soup, first): if not first: return outputParagraph = "" max_length = 210 #approximately three line of text #self.log('printing article: ', article.title) # remove after debug #self.log(soup.prettify()) # remove after debug try: articleDate = getTimestampFromSoup(soup) # remove after debug except Exception as inst: # remove after debug self.log('Exception: ', article.title) # remove after debug self.log(type(inst)) # remove after debug self.log(inst) # remove after debug if articleDate is not None: dateText = time.strftime('%Y-%m-%d', articleDate) #self.log(article.title, ' has timestamp of ', dateText) #self.log('Article Date is of type: ', type(article.date)) # remove after debug #self.log('Derived time is of type: ', type(articleDate)) # remove after debug try: article.date = articleDate article.utctime = dt_factory(articleDate, assume_utc=True, as_utc=True) article.localtime = article.utctime.astimezone(local_tz) if not isWithinDays(articleDate, 2): print 'Article: ', article.title, ' is more than 2 days old' except Exception as inst: # remove after debug self.log('Exception: ', article.title) # remove after debug self.log(type(inst)) # remove after debug self.log(inst) # remove after debug else: dateText = time.strftime('%Y-%m-%d', time.gmtime()) self.log(article.title, ' has no timestamp') #article.date = strftime('%a, %d %b') # remove after debug try: if len(article.text_summary.strip()) == 0: articlebody = soup.find('body') if articlebody: paras = articlebody.findAll('p') for p in paras: refparagraph = self.tag_to_string(p,use_alt=False).strip() #account for blank paragraphs and short paragraphs by appending them to longer ones outputParagraph += (" " + refparagraph) if len(outputParagraph) > max_length: article.summary = article.text_summary = outputParagraph.strip()[0 : max_length] return else: article.summary = article.text_summary = article.text_summary except: self.log("Error creating article descriptions") return |
![]() |
![]() |
![]() |
#11 |
Member
![]() Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
|
Latest Chronicle Recipe attached.
So, I think the probable answer to my last post is "can't be done"...i.e. if you want to exclude an article, you have to make sure that it doesn't get returned in parse_index.
Latest houston chronicle is attached--I'm comfortable with this as a submission for the next build. It's somewhat slow (>4mins on my machine), because it's parsing all article pages (with lxml) in parse_index in order to populate metadata and remove old articles. It does seem strange to me that that the date argument for the Article constructor doesn't appear to populate the finished date in the ebook--had to revisit Article.date in populate_article_metadata. I see that the API allows saving content to a temporary file, and there's an example in LeMonde. If I have time I'll see if I can figure out how to apply that here...might speed things up a bit, but unclear to me how embedded pictures will be handled. Would be happy to take any suggestions for improvement. Thanks, Dale |
![]() |
![]() |
![]() |
#12 |
Zealot
![]() Posts: 103
Karma: 10
Join Date: Sep 2013
Device: Kindle Paperwhite (2012)
|
BeautifulSoup version
The original question asks which version of Beautifulsoup calibre is using. It appears it is version 3. Is there a way to use version 4? If not, when will calibre start using the new version?
|
![]() |
![]() |
![]() |
#13 |
Zealot
![]() Posts: 103
Karma: 10
Join Date: Sep 2013
Device: Kindle Paperwhite (2012)
|
Beautiful soup is no longer the recomended way to develop recipes, see: https://bugs.launchpad.net/calibre/+bug/1247222
|
![]() |
![]() |
![]() |
Tags |
beautifulsoup, calibre, chron.com, parser, recipe |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Beautiful soup findAll doesn't seem to work | Steven630 | Recipes | 13 | 08-19-2012 02:44 AM |
HTML5 parsing | nickredding | Conversion | 8 | 08-09-2012 09:50 AM |
Parsing Index | Steven630 | Recipes | 0 | 07-06-2012 04:53 AM |
iPad PageList parsing using Javascript. | Oh.Danny.Boy | Apple Devices | 0 | 05-17-2012 05:24 PM |
Parsing Titles | cgraving | Calibre | 3 | 01-17-2011 02:52 AM |