|
|
#1 |
|
Member
![]() Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
|
Parsing Chron.com with Beautiful Soup
Hello;
First time poster, and relatively new to python, so please bear with me. I have a simple python script to scrape Chron.com for useful links, which I wish to use to create an epub. That script is posted below. I'm using BeautifulSoup V4, and parsing the pages as xml files. Leaving out the "xml" option gives poor results. BeautifulStoneSoup is deprecated for version 4, but it gives the same results as the "xml" option in BeautifulSoup, as I'd expect. It appears to me that Calibre is using version 3 of BeautifulSoup, and it's not giving me the same results as the script. How can I address this? If the answer is "use lxml" what is the proper import statement for the recipe? Thanks, Dale Code:
from bs4 import BeautifulSoup
from bs4 import Tag
import re
import urllib2
print 'go to Chron sites, scrape them for useful links'
baseUrl = 'http://www.chron.com'
pages = {'news' : '/news/houston-texas/',
'business' : '/business/',
'opinion': '/opinion/',
'sports': '/sports/'}
page_links = dict()
for page in pages.keys():
url = urllib2.urlopen(baseUrl + pages[page])
content = url.read()
soup = BeautifulSoup(content, "xml")
divs = soup.findAll('div', attrs={'class': re.compile('simplelist|scp-feature')})
links_dict = {}
for div in divs:
print 'Page: ', page, ' div: ', div['class'], ' Number of Children: ', len(div.findChildren())
for element in div.descendants:
if isinstance(element, Tag) and element.name == u'a' and len(element['href']) > 10:
if len(element.contents[0]) > 10:
links_dict[baseUrl + element['href']] = element.contents[0]
page_links[page] = links_dict
print 'Here is the result of the web scrape'
for page in page_links.keys():
links_dict = page_links[page]
for link in links_dict:
print page, " | ", links_dict[link], " | ", link
|
|
|
|
|
|
#2 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,628
Karma: 28549046
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
The import statement is the same as for using lxml in any python script. The recipe system does not care what you use to parse html, all it cares is that parse_index() returns the correct data structure, as documented.
|
|
|
|
| Advert | |
|
|
|
|
#3 |
|
Member
![]() Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
|
Thanks for the prompt reply. Okay, I think I got it...the problem was the descendants attribute, when I used findChildren, it all worked. I posted the recipe below, will test it for awhile, then submit it for calibre inclusion...
There's no description on the the sites from which I'm obtaining the feeds, but there is a description on the feed destination...Is there any established way to handle this, other than by grabbing the text from a call within parse_index ? Code:
#!/usr/bin/env python
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
__license__ = 'GPL v3'
__copyright__ = '2013, Dale Furrow dkfurrow@gmail.com'
'''
chron.com
'''
import re
import urllib2
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag
class HoustonChronicle(BasicNewsRecipe):
title = u'The Houston Chronicle'
description = 'News from Houston, Texas'
__author__ = 'Dale Furrow'
language = 'en'
no_stylesheets = True
#use_embedded_content = False
remove_attributes = ['style']
auto_cleanup = True
def parse_index(self):
self.timefmt = ' [%a, %d %b, %Y]'
baseUrl = 'http://www.chron.com'
pages = [('news' , '/news/houston-texas/'),
('business' , '/business/'),
('opinion', '/opinion/'),
('sports', '/sports/')]
feeds = []
totalLinks = 0
for page in pages:
articles = []
section_links = set()
url = urllib2.urlopen(baseUrl + page[1])
content = url.read()
soup = BeautifulSoup(content)
divs = soup.findAll('div', attrs={'class': re.compile('scp-feature|simplelist')})
for div in divs:
self.log( 'Page: ', page[0], ' div: ', div['class'], ' Number of Children: ', len(div.findChildren()) )
for child in div.findChildren():
if isinstance(child, Tag) and child.name == u'a' and len(child['href']) > 10:
if len(child.contents[0]) > 10 and child['href'] not in section_links:
section_links.add(child['href'])
if child['href'].find('http') == -1:
link = baseUrl + child['href']
else:
link = child['href']
title = child.contents[0]
totalLinks += 1
self.log('\tFound article ', totalLinks, " at " ,title, 'at', link)
articles.append({'title':title, 'url':link, 'description':'', 'date':''})
if articles:
feeds.append((page[0], articles))
self.log('Found ', totalLinks, ' articles --returning feeds')
return feeds
|
|
|
|
|
|
#4 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,628
Karma: 28549046
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
populate_article_metadata()
|
|
|
|
|
|
#5 |
|
Member
![]() Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
|
Yes that works, was able to populate summary from the article body. I'd like to populate the date as well, but it appears that the soup in populate_article_metadata has already been stripped down to the basic article body, thus removing the tags I'm interested in. I tried to use the keep_only_tags feature to add the appropriate tags to the article body...didn't work. I see another poster has the same issue w/ that feature, so I'll just watch that thread. Recipe posted below:
Code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
__license__ = 'GPL v3'
__copyright__ = '2013, Dale Furrow dkfurrow@gmail.com'
'''
chron.com
'''
import re, string
import urllib2
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag
class HoustonChronicle(BasicNewsRecipe):
title = u'The Houston Chronicle'
description = 'News from Houston, Texas'
__author__ = 'Dale Furrow'
language = 'en'
no_stylesheets = True
#use_embedded_content = False
remove_attributes = ['style']
auto_cleanup = True
def parse_index(self):
self.timefmt = ' [%a, %d %b, %Y]'
baseUrl = 'http://www.chron.com'
pages = [('news' , '/news/houston-texas/'),
('business' , '/business/'),
('opinion', '/opinion/'),
('sports', '/sports/')]
feeds = []
totalLinks = 0
for page in pages:
articles = []
section_links = set()
url = urllib2.urlopen(baseUrl + page[1])
content = url.read()
soup = BeautifulSoup(content)
divs = soup.findAll('div', attrs={'class': re.compile('scp-feature|simplelist|scp-item')})
for div in divs:
self.log( 'Page: ', page[0], ' div: ', div['class'], ' Number of Children: ', len(div.findChildren()) )
for child in div.findChildren():
if isinstance(child, Tag) and child.name == u'a' and len(child['href']) > 10:
if len(child.contents[0]) > 10 and child['href'] not in section_links:
section_links.add(child['href'])
if child['href'].find('http') == -1:
link = baseUrl + child['href']
else:
link = child['href']
title = child.contents[0]
totalLinks += 1
self.log('\tFound article ', totalLinks, " at " ,title, 'at', link)
articles.append({'title':title, 'url':link, 'description':'', 'date':''})
if articles:
feeds.append((page[0], articles))
self.log('Found ', totalLinks, ' articles --returning feeds')
return feeds
def populate_article_metadata(self, article, soup, first):
if not first:
return
outputParagraph = ""
max_length = 210 #approximately three line of text
try:
if len(article.text_summary.strip()) == 0:
articlebody = soup.find('body')
if articlebody:
paras = articlebody.findAll('p')
for p in paras:
refparagraph = self.tag_to_string(p,use_alt=False).strip()
#account for blank paragraphs and short paragraphs by appending them to longer ones
outputParagraph += (" " + refparagraph)
if len(outputParagraph) > max_length:
article.summary = article.text_summary = outputParagraph.strip()[0 : max_length]
return
else:
article.summary = article.text_summary = article.text_summary
except:
self.log("Error creating article descriptions")
return
|
|
|
|
| Advert | |
|
|
|
|
#6 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,628
Karma: 28549046
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
If you want to use keep_only_tags you have to first disable auto_cleanup.
|
|
|
|
|
|
#7 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,628
Karma: 28549046
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Also note that when converting a URL to soup, use:
soup = self.index_to_soup(baseUrl + page[1]) |
|
|
|
|
|
#8 |
|
Member
![]() Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
|
Okay, thanks very much for your help. Making progress here, code posted below...keep_only_tags is working, and the correct date is attached to each article.
I plan on removing old articles in populate_article_metadata, but haven't done that yet...I know there are some tags that still need to be removed to make things cleaner, but I have a question on a repeated error message: Traceback (most recent call last): File "site-packages\calibre\web\fetch\simple.py", line 431, in process_images File "site-packages\calibre\utils\magick\__init__.py", line 132, in load Exception: no decode delegate for this image format `' @ error/blob.c/BlobToImage/360 It doesn't halt execution, but it's difficult for me to tell where the error's coming from. Is there an attribute I can set to ignore images which can't be processed? Code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
__license__ = 'GPL v3'
__copyright__ = '2013, Dale Furrow dkfurrow@gmail.com'
'''
chron.com
'''
import re, string, time
import urllib2
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag
from calibre.utils.date import dt_factory, utcnow, local_tz
def getRegularTimestamp(dateString):
try:
outDate = time.strptime(dateString, "%Y-%m-%dT%H:%M:%SZ")
return outDate
except:
return None
regextest = '(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|\
Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?) \
[0-9]{1,2}, 20[01][0-9]'
def GetDateFromString(inText):
match = re.findall(regextest, inText)
if match:
try:
outDate = time.strptime(match[0], "%B %d, %Y")
return outDate
except:
return None
else:
return None
def getTimestampFromSoup (soup):
timestampEle = soup.find('h5', attrs={'class': re.compile('timestamp')})
if timestampEle is not None:
try:
timestampText = timestampEle['title']
return getRegularTimestamp(timestampText)
except:
return None
else:
timestampEle = soup.find('span', attrs={'class': re.compile('post-date|entry-date')})
if timestampEle is not None:
try:
timestampText = timestampEle.string
return GetDateFromString(timestampText)
except:
return None
else:
return None
class HoustonChronicle(BasicNewsRecipe):
title = u'The Houston Chronicle'
description = 'News from Houston, Texas'
__author__ = 'Dale Furrow'
language = 'en'
no_stylesheets = True
#use_embedded_content = False
remove_attributes = ['style']
#auto_cleanup = True
# dict(name='div', attrs={'class':re.compile('toolsList')})
#keep_only_tags = [dict(id=['content', 'heading'])]
#auto_cleanup_keep = '//*[@class="timestamp"]|//span[@class="entry-date"]|//span[@class="post-date"]'
keep_only_tags = [dict(name='div', attrs={'class':re.compile('hentry')}), dict(name='span', attrs={'class':re.compile('post-date|entry-date')}), dict(name='h5', attrs={'class':re.compile('timestamp')}), dict(name='div', attrs={'id':re.compile('post-')}) ]
def parse_index(self):
self.timefmt = ' [%a, %d %b, %Y]'
baseUrl = 'http://www.chron.com'
pages = [('business' , '/business/')]
feeds = []
totalLinks = 0
for page in pages:
articles = []
section_links = set()
#url = urllib2.urlopen(baseUrl + page[1])
#content = url.read()
soup = self.index_to_soup(baseUrl + page[1])
divs = soup.findAll('div', attrs={'class': re.compile('scp-feature|simplelist|scp-item')})
for div in divs:
#self.log( 'Page: ', page[0], ' div: ', div['class'], ' Number of Children: ', len(div.findChildren()) )
for child in div.findChildren():
if isinstance(child, Tag) and child.name == u'a' and len(child['href']) > 10:
if len(child.contents[0]) > 10 and child['href'] not in section_links:
section_links.add(child['href'])
if child['href'].find('http') == -1:
link = baseUrl + child['href']
else:
link = child['href']
title = child.contents[0]
totalLinks += 1
self.log('\tFound article ', totalLinks, " at " ,title, 'at', link)
articles.append({'title':title, 'url':link, 'description':'', 'date':''})
if articles:
feeds.append((page[0], articles))
self.log('Found ', totalLinks, ' articles --returning feeds')
return feeds
def populate_article_metadata(self, article, soup, first):
if not first:
return
outputParagraph = ""
max_length = 210 #approximately three line of text
#self.log('printing article: ', article.title) # remove after debug
#self.log(soup.prettify()) # remove after debug
try:
articleDate = getTimestampFromSoup(soup) # remove after debug
except Exception as inst: # remove after debug
self.log('Exception: ', article.title) # remove after debug
self.log(type(inst)) # remove after debug
self.log(inst) # remove after debug
if articleDate is not None:
dateText = time.strftime('%Y-%m-%d', articleDate)
self.log(article.title, ' has timestamp of ', dateText)
#self.log('Article Date is of type: ', type(article.date)) # remove after debug
#self.log('Derived time is of type: ', type(articleDate)) # remove after debug
try:
article.date = articleDate
article.utctime = dt_factory(articleDate, assume_utc=True, as_utc=True)
article.localtime = article.utctime.astimezone(local_tz)
except Exception as inst: # remove after debug
self.log('Exception: ', article.title) # remove after debug
self.log(type(inst)) # remove after debug
self.log(inst) # remove after debug
else:
dateText = time.strftime('%Y-%m-%d', time.gmtime())
self.log(article.title, ' has no timestamp')
#article.date = strftime('%a, %d %b') # remove after debug
try:
if len(article.text_summary.strip()) == 0:
articlebody = soup.find('body')
if articlebody:
paras = articlebody.findAll('p')
for p in paras:
refparagraph = self.tag_to_string(p,use_alt=False).strip()
#account for blank paragraphs and short paragraphs by appending them to longer ones
outputParagraph += (" " + refparagraph)
if len(outputParagraph) > max_length:
article.summary = article.text_summary = outputParagraph.strip()[0 : max_length]
return
else:
article.summary = article.text_summary = article.text_summary
except:
self.log("Error creating article descriptions")
return
|
|
|
|
|
|
#9 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,628
Karma: 28549046
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Dont worry about that error, images that cannot be processed dont affect the download process, they are simply ignored. If you really want to remove them, use preprocess_html()
|
|
|
|
|
|
#10 |
|
Member
![]() Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
|
Got the tags cleaned up tolerably well...Only thing I haven't seemed to be able to do is to delete specific articles, after parsing, based on date (line 148, I'd like to delete articles with age > 2 days). I've attached the correct date to the article in populate_article_metadata...is it possible to delete the article there? If so, what's the correct syntax? Couldn't seem to make anything work to do that.
Code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
__license__ = 'GPL v3'
__copyright__ = '2013, Dale Furrow dkfurrow@gmail.com'
'''
chron.com
'''
import re, string, time
import urllib2
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag
from calibre.utils.date import dt_factory, utcnow, local_tz
from datetime import datetime, timedelta
def getRegularTimestamp(dateString):
try:
outDate = time.strptime(dateString, "%Y-%m-%dT%H:%M:%SZ")
return outDate
except:
return None
regextest = '(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|\
Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?) \
[0-9]{1,2}, 20[01][0-9]'
def GetDateFromString(inText):
match = re.findall(regextest, inText)
if match:
try:
outDate = time.strptime(match[0], "%B %d, %Y")
return outDate
except:
return None
else:
return None
def isWithinDays(inTT,daysAgo):
daysAgoDateTime = datetime.now()-timedelta(days = daysAgo)
DaysAgoDateTime = datetime(inTT[0], inTT[1], inTT[2], inTT[3],
inTT[4], inTT[5])
return DaysAgoDateTime > daysAgoDateTime
def getTimestampFromSoup (soup):
timestampEle = soup.find('h5', attrs={'class': re.compile('timestamp')})
if timestampEle is not None:
try:
timestampText = timestampEle['title']
return getRegularTimestamp(timestampText)
except:
return None
else:
timestampEle = soup.find('span', attrs={'class': re.compile('post-date|entry-date')})
if timestampEle is not None:
try:
timestampText = timestampEle.string
return GetDateFromString(timestampText)
except:
return None
else:
return None
class HoustonChronicle(BasicNewsRecipe):
title = u'The Houston Chronicle'
description = 'News from Houston, Texas'
__author__ = 'Dale Furrow'
language = 'en'
no_stylesheets = True
#use_embedded_content = False
remove_attributes = ['style']
remove_empty_feeds = True
keep_only_tags = [dict(name='div', attrs={'class':re.compile('hentry')}),
dict(name='span', attrs={'class':re.compile('post-date|entry-date')}),
dict(name='h5', attrs={'class':re.compile('timestamp')}),
dict(name='div', attrs={'id':re.compile('post-')}) ]
remove_tags = [dict(name='div', attrs={'class':'socialBar'}),
dict(name='div', attrs={'class':re.compile('post-commentmeta')}),
dict(name='div', attrs={'class':re.compile('slideshow_wrapper')})]
def parse_index(self):
self.timefmt = ' [%a, %d %b, %Y]'
baseUrl = 'http://www.chron.com'
pages = [('business' , '/business/')]
feeds = []
totalLinks = 0
for page in pages:
articles = []
section_links = set()
#url = urllib2.urlopen(baseUrl + page[1])
#content = url.read()
soup = self.index_to_soup(baseUrl + page[1])
divs = soup.findAll('div', attrs={'class': re.compile('scp-feature|simplelist|scp-item')})
for div in divs:
#self.log( 'Page: ', page[0], ' div: ', div['class'], ' Number of Children: ', len(div.findChildren()) )
for child in div.findChildren():
if isinstance(child, Tag) and child.name == u'a' and len(child['href']) > 10:
if len(child.contents[0]) > 10 and child['href'] not in section_links:
section_links.add(child['href'])
if child['href'].find('http') == -1:
link = baseUrl + child['href']
else:
link = child['href']
title = child.contents[0]
totalLinks += 1
self.log('\tFound article ', totalLinks, " at " ,title, 'at', link)
articles.append({'title':title, 'url':link, 'description':'', 'date':''})
if articles:
feeds.append((page[0], articles))
self.log('Found ', totalLinks, ' articles --returning feeds')
return feeds
def populate_article_metadata(self, article, soup, first):
if not first:
return
outputParagraph = ""
max_length = 210 #approximately three line of text
#self.log('printing article: ', article.title) # remove after debug
#self.log(soup.prettify()) # remove after debug
try:
articleDate = getTimestampFromSoup(soup) # remove after debug
except Exception as inst: # remove after debug
self.log('Exception: ', article.title) # remove after debug
self.log(type(inst)) # remove after debug
self.log(inst) # remove after debug
if articleDate is not None:
dateText = time.strftime('%Y-%m-%d', articleDate)
#self.log(article.title, ' has timestamp of ', dateText)
#self.log('Article Date is of type: ', type(article.date)) # remove after debug
#self.log('Derived time is of type: ', type(articleDate)) # remove after debug
try:
article.date = articleDate
article.utctime = dt_factory(articleDate, assume_utc=True, as_utc=True)
article.localtime = article.utctime.astimezone(local_tz)
if not isWithinDays(articleDate, 2):
print 'Article: ', article.title, ' is more than 2 days old'
except Exception as inst: # remove after debug
self.log('Exception: ', article.title) # remove after debug
self.log(type(inst)) # remove after debug
self.log(inst) # remove after debug
else:
dateText = time.strftime('%Y-%m-%d', time.gmtime())
self.log(article.title, ' has no timestamp')
#article.date = strftime('%a, %d %b') # remove after debug
try:
if len(article.text_summary.strip()) == 0:
articlebody = soup.find('body')
if articlebody:
paras = articlebody.findAll('p')
for p in paras:
refparagraph = self.tag_to_string(p,use_alt=False).strip()
#account for blank paragraphs and short paragraphs by appending them to longer ones
outputParagraph += (" " + refparagraph)
if len(outputParagraph) > max_length:
article.summary = article.text_summary = outputParagraph.strip()[0 : max_length]
return
else:
article.summary = article.text_summary = article.text_summary
except:
self.log("Error creating article descriptions")
return
|
|
|
|
|
|
#11 |
|
Member
![]() Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
|
Latest Chronicle Recipe attached.
So, I think the probable answer to my last post is "can't be done"...i.e. if you want to exclude an article, you have to make sure that it doesn't get returned in parse_index.
Latest houston chronicle is attached--I'm comfortable with this as a submission for the next build. It's somewhat slow (>4mins on my machine), because it's parsing all article pages (with lxml) in parse_index in order to populate metadata and remove old articles. It does seem strange to me that that the date argument for the Article constructor doesn't appear to populate the finished date in the ebook--had to revisit Article.date in populate_article_metadata. I see that the API allows saving content to a temporary file, and there's an example in LeMonde. If I have time I'll see if I can figure out how to apply that here...might speed things up a bit, but unclear to me how embedded pictures will be handled. Would be happy to take any suggestions for improvement. Thanks, Dale |
|
|
|
|
|
#12 |
|
Zealot
![]() Posts: 106
Karma: 10
Join Date: Sep 2013
Device: Kindle Paperwhite (2012)
|
BeautifulSoup version
The original question asks which version of Beautifulsoup calibre is using. It appears it is version 3. Is there a way to use version 4? If not, when will calibre start using the new version?
|
|
|
|
|
|
#13 |
|
Zealot
![]() Posts: 106
Karma: 10
Join Date: Sep 2013
Device: Kindle Paperwhite (2012)
|
Beautiful soup is no longer the recomended way to develop recipes, see: https://bugs.launchpad.net/calibre/+bug/1247222
|
|
|
|
![]() |
| Tags |
| beautifulsoup, calibre, chron.com, parser, recipe |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Beautiful soup findAll doesn't seem to work | Steven630 | Recipes | 13 | 08-19-2012 03:44 AM |
| HTML5 parsing | nickredding | Conversion | 8 | 08-09-2012 10:50 AM |
| Parsing Index | Steven630 | Recipes | 0 | 07-06-2012 05:53 AM |
| iPad PageList parsing using Javascript. | Oh.Danny.Boy | Apple Devices | 0 | 05-17-2012 06:24 PM |
| Parsing Titles | cgraving | Calibre | 3 | 01-17-2011 03:52 AM |