Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 06-12-2013, 01:15 PM   #1
dkfurrow
Member
dkfurrow began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
Parsing Chron.com with Beautiful Soup

Hello;
First time poster, and relatively new to python, so please bear with me.
I have a simple python script to scrape Chron.com for useful links, which I wish to use to create an epub. That script is posted below.

I'm using BeautifulSoup V4, and parsing the pages as xml files. Leaving out the "xml" option gives poor results. BeautifulStoneSoup is deprecated for version 4, but it gives the same results as the "xml" option in BeautifulSoup, as I'd expect.

It appears to me that Calibre is using version 3 of BeautifulSoup, and it's not giving me the same results as the script. How can I address this? If the answer is "use lxml" what is the proper import statement for the recipe?

Thanks,
Dale
Code:
from bs4 import BeautifulSoup  
from bs4 import Tag
import re
import urllib2

print 'go to Chron sites, scrape them for useful links'
baseUrl = 'http://www.chron.com'

pages = {'news' : '/news/houston-texas/', 
         'business' : '/business/', 
         'opinion': '/opinion/', 
         'sports': '/sports/'}
page_links = dict()        
        
for page in pages.keys():
    url = urllib2.urlopen(baseUrl + pages[page])
    content = url.read()
    soup = BeautifulSoup(content, "xml")
    divs = soup.findAll('div', attrs={'class': re.compile('simplelist|scp-feature')})
    links_dict = {}
    for div in divs:
        print 'Page: ', page, ' div: ', div['class'], ' Number of Children: ', len(div.findChildren()) 
        for element in div.descendants:
            if isinstance(element, Tag) and element.name == u'a' and len(element['href']) > 10:
                if len(element.contents[0]) > 10:
                    links_dict[baseUrl + element['href']] = element.contents[0]
    page_links[page] = links_dict                
            

print 'Here is the result of the web scrape'
for page in page_links.keys():
    links_dict = page_links[page]
    for link in links_dict:
        print page, " | ", links_dict[link], " | ", link
dkfurrow is offline   Reply With Quote
Old 06-12-2013, 09:28 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,839
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
The import statement is the same as for using lxml in any python script. The recipe system does not care what you use to parse html, all it cares is that parse_index() returns the correct data structure, as documented.
kovidgoyal is offline   Reply With Quote
Old 06-13-2013, 05:25 PM   #3
dkfurrow
Member
dkfurrow began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
Thanks for the prompt reply. Okay, I think I got it...the problem was the descendants attribute, when I used findChildren, it all worked. I posted the recipe below, will test it for awhile, then submit it for calibre inclusion...

There's no description on the the sites from which I'm obtaining the feeds, but there is a description on the feed destination...Is there any established way to handle this, other than by grabbing the text from a call within parse_index ?
Code:
#!/usr/bin/env python
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
__license__   = 'GPL v3'
__copyright__ = '2013, Dale Furrow dkfurrow@gmail.com'
'''
chron.com
'''
import re
import urllib2
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag


class HoustonChronicle(BasicNewsRecipe):

    title      =  u'The Houston Chronicle'
    description    = 'News from Houston, Texas'
    __author__ = 'Dale Furrow'
    language = 'en'
    no_stylesheets = True
    #use_embedded_content = False
    remove_attributes = ['style']
    auto_cleanup = True
    
    

    def parse_index(self):
        
        self.timefmt = ' [%a, %d %b, %Y]'
        baseUrl = 'http://www.chron.com'
        pages = [('news' , '/news/houston-texas/'), 
        ('business' , '/business/'), 
        ('opinion', '/opinion/'), 
        ('sports', '/sports/')]
        feeds = []
        totalLinks = 0
        for page in pages:
            articles = []
            section_links = set()
            url = urllib2.urlopen(baseUrl + page[1])
            content = url.read()
            soup = BeautifulSoup(content)
            divs = soup.findAll('div', attrs={'class': re.compile('scp-feature|simplelist')})
            for div in divs:
                self.log( 'Page: ', page[0], ' div: ', div['class'], ' Number of Children: ', len(div.findChildren()) )
                for child in div.findChildren():
                    if isinstance(child, Tag) and child.name == u'a' and len(child['href']) > 10:
                        if len(child.contents[0]) > 10 and child['href'] not in section_links:
                            section_links.add(child['href'])
                            if child['href'].find('http') == -1:
                                link = baseUrl + child['href']
                            else:
                                link = child['href']
                            title = child.contents[0]
                            totalLinks += 1
                            self.log('\tFound article ', totalLinks, " at " ,title, 'at', link)
                            articles.append({'title':title, 'url':link, 'description':'', 'date':''})
            if articles:
                feeds.append((page[0], articles))
        self.log('Found ', totalLinks, ' articles --returning feeds')
        return feeds
dkfurrow is offline   Reply With Quote
Old 06-13-2013, 10:01 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,839
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
populate_article_metadata()
kovidgoyal is offline   Reply With Quote
Old 06-19-2013, 05:50 PM   #5
dkfurrow
Member
dkfurrow began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
Yes that works, was able to populate summary from the article body. I'd like to populate the date as well, but it appears that the soup in populate_article_metadata has already been stripped down to the basic article body, thus removing the tags I'm interested in. I tried to use the keep_only_tags feature to add the appropriate tags to the article body...didn't work. I see another poster has the same issue w/ that feature, so I'll just watch that thread. Recipe posted below:
Code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
__license__   = 'GPL v3'
__copyright__ = '2013, Dale Furrow dkfurrow@gmail.com'
'''
chron.com
'''
import re, string
import urllib2
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag

class HoustonChronicle(BasicNewsRecipe):

    title      =  u'The Houston Chronicle'
    description    = 'News from Houston, Texas'
    __author__ = 'Dale Furrow'
    language = 'en'
    no_stylesheets = True
    #use_embedded_content = False
    remove_attributes = ['style']
    auto_cleanup = True
    

    def parse_index(self):
        
        self.timefmt = ' [%a, %d %b, %Y]'
        baseUrl = 'http://www.chron.com'
        pages = [('news' , '/news/houston-texas/'), 
        ('business' , '/business/'), 
        ('opinion', '/opinion/'), 
        ('sports', '/sports/')]
        feeds = []
        totalLinks = 0
        for page in pages:
            articles = []
            section_links = set()
            url = urllib2.urlopen(baseUrl + page[1])
            content = url.read()
            soup = BeautifulSoup(content)
            divs = soup.findAll('div', attrs={'class': re.compile('scp-feature|simplelist|scp-item')})
            for div in divs:
                self.log( 'Page: ', page[0], ' div: ', div['class'], ' Number of Children: ', len(div.findChildren()) )
                for child in div.findChildren():
                    if isinstance(child, Tag) and child.name == u'a' and len(child['href']) > 10:
                        if len(child.contents[0]) > 10 and child['href'] not in section_links:
                            section_links.add(child['href'])
                            if child['href'].find('http') == -1:
                                link = baseUrl + child['href']
                            else:
                                link = child['href']
                            title = child.contents[0]
                            totalLinks += 1
                            self.log('\tFound article ', totalLinks, " at " ,title, 'at', link)
                            articles.append({'title':title, 'url':link, 'description':'', 'date':''})
            if articles:
                feeds.append((page[0], articles))
        self.log('Found ', totalLinks, ' articles --returning feeds')
        return feeds
        
    def populate_article_metadata(self, article, soup, first):
        if not first:
            return
        outputParagraph = ""
        max_length = 210 #approximately three line of text
        try:
            if len(article.text_summary.strip()) == 0:
                articlebody = soup.find('body')
                if articlebody:
                    paras = articlebody.findAll('p')
                    for p in paras:
                            refparagraph = self.tag_to_string(p,use_alt=False).strip()
                            #account for blank paragraphs and short paragraphs by appending them to longer ones
                            outputParagraph += (" " + refparagraph)
                            if len(outputParagraph) > max_length: 
                                article.summary = article.text_summary = outputParagraph.strip()[0 : max_length]
                                return
            else:
                article.summary = article.text_summary = article.text_summary
        except:
            self.log("Error creating article descriptions")
            return
dkfurrow is offline   Reply With Quote
Old 06-19-2013, 11:51 PM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,839
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
If you want to use keep_only_tags you have to first disable auto_cleanup.
kovidgoyal is offline   Reply With Quote
Old 06-20-2013, 12:00 AM   #7
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,839
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Also note that when converting a URL to soup, use:

soup = self.index_to_soup(baseUrl + page[1])
kovidgoyal is offline   Reply With Quote
Old 06-25-2013, 07:09 PM   #8
dkfurrow
Member
dkfurrow began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
Okay, thanks very much for your help. Making progress here, code posted below...keep_only_tags is working, and the correct date is attached to each article.
I plan on removing old articles in populate_article_metadata, but haven't done that yet...I know there are some tags that still need to be removed to make things cleaner, but I have a question on a repeated error message:

Traceback (most recent call last):
File "site-packages\calibre\web\fetch\simple.py", line 431, in process_images
File "site-packages\calibre\utils\magick\__init__.py", line 132, in load
Exception: no decode delegate for this image format `' @ error/blob.c/BlobToImage/360

It doesn't halt execution, but it's difficult for me to tell where the error's coming from. Is there an attribute I can set to ignore images which can't be processed?



Code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
__license__   = 'GPL v3'
__copyright__ = '2013, Dale Furrow dkfurrow@gmail.com'
'''
chron.com
'''
import re, string, time
import urllib2
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag
from calibre.utils.date import dt_factory, utcnow, local_tz


def getRegularTimestamp(dateString):
    try:
        outDate = time.strptime(dateString, "%Y-%m-%dT%H:%M:%SZ")
        return outDate
    except:
        return None    

regextest = '(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|\
Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?) \
[0-9]{1,2}, 20[01][0-9]'

def GetDateFromString(inText):
    match = re.findall(regextest, inText)
    if match:
        try: 
            outDate = time.strptime(match[0], "%B %d, %Y")
            return outDate
        except:
            return None
    else:
        return None
    

def getTimestampFromSoup (soup):
    timestampEle = soup.find('h5', attrs={'class': re.compile('timestamp')})
    if timestampEle is not None:
        try:
            timestampText = timestampEle['title']
            return getRegularTimestamp(timestampText)
        except:
            return None    
    else:
        timestampEle = soup.find('span', attrs={'class': re.compile('post-date|entry-date')})
        if timestampEle is not None:
            try:
                timestampText = timestampEle.string
                return GetDateFromString(timestampText)
            except:
                return None
        else:
            return None


class HoustonChronicle(BasicNewsRecipe):

    title      =  u'The Houston Chronicle'
    description    = 'News from Houston, Texas'
    __author__ = 'Dale Furrow'
    language = 'en'
    no_stylesheets = True
    #use_embedded_content = False
    remove_attributes = ['style']
    #auto_cleanup = True
    # dict(name='div', attrs={'class':re.compile('toolsList')})
    #keep_only_tags = [dict(id=['content', 'heading'])]
    #auto_cleanup_keep = '//*[@class="timestamp"]|//span[@class="entry-date"]|//span[@class="post-date"]'
    
    keep_only_tags = [dict(name='div', attrs={'class':re.compile('hentry')}), dict(name='span', attrs={'class':re.compile('post-date|entry-date')}), dict(name='h5', attrs={'class':re.compile('timestamp')}), dict(name='div', attrs={'id':re.compile('post-')}) ]
    
        
    

    def parse_index(self):
        
        self.timefmt = ' [%a, %d %b, %Y]'
        baseUrl = 'http://www.chron.com'
        pages = [('business' , '/business/')]
        feeds = []
        totalLinks = 0
        for page in pages:
            articles = []
            section_links = set()
            #url = urllib2.urlopen(baseUrl + page[1])
            #content = url.read()
            soup = self.index_to_soup(baseUrl + page[1]) 
            divs = soup.findAll('div', attrs={'class': re.compile('scp-feature|simplelist|scp-item')})
            for div in divs:
                #self.log( 'Page: ', page[0], ' div: ', div['class'], ' Number of Children: ', len(div.findChildren()) )
                for child in div.findChildren():
                    if isinstance(child, Tag) and child.name == u'a' and len(child['href']) > 10:
                        if len(child.contents[0]) > 10 and child['href'] not in section_links:
                            section_links.add(child['href'])
                            if child['href'].find('http') == -1:
                                link = baseUrl + child['href']
                            else:
                                link = child['href']
                            title = child.contents[0]
                            totalLinks += 1
                            self.log('\tFound article ', totalLinks, " at " ,title, 'at', link)
                            articles.append({'title':title, 'url':link, 'description':'', 'date':''})
            if articles:
                feeds.append((page[0], articles))
        self.log('Found ', totalLinks, ' articles --returning feeds')
        return feeds
    
    
    
        
    def populate_article_metadata(self, article, soup, first):
        if not first:
            return
        outputParagraph = ""
        max_length = 210 #approximately three line of text
        #self.log('printing article: ', article.title) # remove after debug
        #self.log(soup.prettify()) # remove after debug
        try:
            articleDate = getTimestampFromSoup(soup) # remove after debug
        except Exception as inst: # remove after debug
            self.log('Exception: ', article.title) # remove after debug
            self.log(type(inst)) # remove after debug
            self.log(inst) # remove after debug
        if articleDate is not None:
            dateText = time.strftime('%Y-%m-%d', articleDate)
            self.log(article.title, ' has timestamp of ', dateText)
            #self.log('Article Date is of type: ', type(article.date)) # remove after debug
            #self.log('Derived time is of type: ', type(articleDate)) # remove after debug
            try:
                article.date = articleDate
                article.utctime = dt_factory(articleDate, assume_utc=True, as_utc=True)
                article.localtime = article.utctime.astimezone(local_tz)
            except Exception as inst: # remove after debug
                self.log('Exception: ', article.title) # remove after debug
                self.log(type(inst)) # remove after debug
                self.log(inst) # remove after debug
        else:
            dateText = time.strftime('%Y-%m-%d', time.gmtime())
            self.log(article.title, ' has no timestamp')
            #article.date = strftime('%a, %d %b') # remove after debug
        try:
            if len(article.text_summary.strip()) == 0:
                articlebody = soup.find('body')
                if articlebody:
                    paras = articlebody.findAll('p')
                    for p in paras:
                            refparagraph = self.tag_to_string(p,use_alt=False).strip()
                            #account for blank paragraphs and short paragraphs by appending them to longer ones
                            outputParagraph += (" " + refparagraph)
                            if len(outputParagraph) > max_length: 
                                article.summary = article.text_summary = outputParagraph.strip()[0 : max_length]
                                return
            else:
                article.summary = article.text_summary = article.text_summary
        except:
            self.log("Error creating article descriptions")
            return
dkfurrow is offline   Reply With Quote
Old 06-26-2013, 12:11 AM   #9
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,839
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Dont worry about that error, images that cannot be processed dont affect the download process, they are simply ignored. If you really want to remove them, use preprocess_html()
kovidgoyal is offline   Reply With Quote
Old 06-27-2013, 11:13 PM   #10
dkfurrow
Member
dkfurrow began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
Got the tags cleaned up tolerably well...Only thing I haven't seemed to be able to do is to delete specific articles, after parsing, based on date (line 148, I'd like to delete articles with age > 2 days). I've attached the correct date to the article in populate_article_metadata...is it possible to delete the article there? If so, what's the correct syntax? Couldn't seem to make anything work to do that.


Code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
__license__   = 'GPL v3'
__copyright__ = '2013, Dale Furrow dkfurrow@gmail.com'
'''
chron.com
'''
import re, string, time
import urllib2
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag
from calibre.utils.date import dt_factory, utcnow, local_tz
from datetime import datetime, timedelta


def getRegularTimestamp(dateString):
    try:
        outDate = time.strptime(dateString, "%Y-%m-%dT%H:%M:%SZ")
        return outDate
    except:
        return None    

regextest = '(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|\
Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?) \
[0-9]{1,2}, 20[01][0-9]'

def GetDateFromString(inText):
    match = re.findall(regextest, inText)
    if match:
        try: 
            outDate = time.strptime(match[0], "%B %d, %Y")
            return outDate
        except:
            return None
    else:
        return None

def isWithinDays(inTT,daysAgo):
    daysAgoDateTime = datetime.now()-timedelta(days = daysAgo)
    DaysAgoDateTime =  datetime(inTT[0], inTT[1], inTT[2], inTT[3], 
    inTT[4], inTT[5])
    return DaysAgoDateTime > daysAgoDateTime
    

def getTimestampFromSoup (soup):
    timestampEle = soup.find('h5', attrs={'class': re.compile('timestamp')})
    if timestampEle is not None:
        try:
            timestampText = timestampEle['title']
            return getRegularTimestamp(timestampText)
        except:
            return None    
    else:
        timestampEle = soup.find('span', attrs={'class': re.compile('post-date|entry-date')})
        if timestampEle is not None:
            try:
                timestampText = timestampEle.string
                return GetDateFromString(timestampText)
            except:
                return None
        else:
            return None


class HoustonChronicle(BasicNewsRecipe):

    title      =  u'The Houston Chronicle'
    description    = 'News from Houston, Texas'
    __author__ = 'Dale Furrow'
    language = 'en'
    no_stylesheets = True
    #use_embedded_content = False
    remove_attributes = ['style']
    remove_empty_feeds = True
    
    
    keep_only_tags = [dict(name='div', attrs={'class':re.compile('hentry')}), 
                      dict(name='span', attrs={'class':re.compile('post-date|entry-date')}), 
                      dict(name='h5', attrs={'class':re.compile('timestamp')}), 
                      dict(name='div', attrs={'id':re.compile('post-')}) ]
    
    
    remove_tags = [dict(name='div', attrs={'class':'socialBar'}), 
                   dict(name='div', attrs={'class':re.compile('post-commentmeta')}),
                   dict(name='div', attrs={'class':re.compile('slideshow_wrapper')})]
        
    

    def parse_index(self):
        
        self.timefmt = ' [%a, %d %b, %Y]'
        baseUrl = 'http://www.chron.com'
        pages = [('business' , '/business/')]
        feeds = []
        totalLinks = 0
        for page in pages:
            articles = []
            section_links = set()
            #url = urllib2.urlopen(baseUrl + page[1])
            #content = url.read()
            soup = self.index_to_soup(baseUrl + page[1]) 
            divs = soup.findAll('div', attrs={'class': re.compile('scp-feature|simplelist|scp-item')})
            for div in divs:
                #self.log( 'Page: ', page[0], ' div: ', div['class'], ' Number of Children: ', len(div.findChildren()) )
                for child in div.findChildren():
                    if isinstance(child, Tag) and child.name == u'a' and len(child['href']) > 10:
                        if len(child.contents[0]) > 10 and child['href'] not in section_links:
                            section_links.add(child['href'])
                            if child['href'].find('http') == -1:
                                link = baseUrl + child['href']
                            else:
                                link = child['href']
                            title = child.contents[0]
                            totalLinks += 1
                            self.log('\tFound article ', totalLinks, " at " ,title, 'at', link)
                            articles.append({'title':title, 'url':link, 'description':'', 'date':''})
            if articles:
                feeds.append((page[0], articles))
        self.log('Found ', totalLinks, ' articles --returning feeds')
        return feeds
    
    
    
        
    def populate_article_metadata(self, article, soup, first):
        if not first:
            return
        outputParagraph = ""
        max_length = 210 #approximately three line of text
        #self.log('printing article: ', article.title) # remove after debug
        #self.log(soup.prettify()) # remove after debug
        try:
            articleDate = getTimestampFromSoup(soup) # remove after debug
        except Exception as inst: # remove after debug
            self.log('Exception: ', article.title) # remove after debug
            self.log(type(inst)) # remove after debug
            self.log(inst) # remove after debug
        if articleDate is not None:
            dateText = time.strftime('%Y-%m-%d', articleDate)
            #self.log(article.title, ' has timestamp of ', dateText)
            #self.log('Article Date is of type: ', type(article.date)) # remove after debug
            #self.log('Derived time is of type: ', type(articleDate)) # remove after debug
            try:
                article.date = articleDate
                article.utctime = dt_factory(articleDate, assume_utc=True, as_utc=True)
                article.localtime = article.utctime.astimezone(local_tz)
                if not isWithinDays(articleDate, 2):
                    print 'Article: ', article.title, ' is more than 2 days old'
            except Exception as inst: # remove after debug
                self.log('Exception: ', article.title) # remove after debug
                self.log(type(inst)) # remove after debug
                self.log(inst) # remove after debug
        else:
            dateText = time.strftime('%Y-%m-%d', time.gmtime())
            self.log(article.title, ' has no timestamp')
            #article.date = strftime('%a, %d %b') # remove after debug
        try:
            if len(article.text_summary.strip()) == 0:
                articlebody = soup.find('body')
                if articlebody:
                    paras = articlebody.findAll('p')
                    for p in paras:
                            refparagraph = self.tag_to_string(p,use_alt=False).strip()
                            #account for blank paragraphs and short paragraphs by appending them to longer ones
                            outputParagraph += (" " + refparagraph)
                            if len(outputParagraph) > max_length: 
                                article.summary = article.text_summary = outputParagraph.strip()[0 : max_length]
                                return
            else:
                article.summary = article.text_summary = article.text_summary
        except:
            self.log("Error creating article descriptions")
            return
dkfurrow is offline   Reply With Quote
Old 07-09-2013, 11:42 AM   #11
dkfurrow
Member
dkfurrow began at the beginning.
 
Posts: 13
Karma: 10
Join Date: Jun 2013
Device: LG G-Pad 8.3
Latest Chronicle Recipe attached.

So, I think the probable answer to my last post is "can't be done"...i.e. if you want to exclude an article, you have to make sure that it doesn't get returned in parse_index.

Latest houston chronicle is attached--I'm comfortable with this as a submission for the next build. It's somewhat slow (>4mins on my machine), because it's parsing all article pages (with lxml) in parse_index in order to populate metadata and remove old articles. It does seem strange to me that that the date argument for the Article constructor doesn't appear to populate the finished date in the ebook--had to revisit Article.date in populate_article_metadata.

I see that the API allows saving content to a temporary file, and there's an example in LeMonde. If I have time I'll see if I can figure out how to apply that here...might speed things up a bit, but unclear to me how embedded pictures will be handled.

Would be happy to take any suggestions for improvement.
Thanks,
Dale
Attached Files
File Type: zip houston_chronicle.zip (2.8 KB, 214 views)
dkfurrow is offline   Reply With Quote
Old 10-29-2013, 02:41 PM   #12
sup
Connoisseur
sup began at the beginning.
 
Posts: 95
Karma: 10
Join Date: Sep 2013
Device: Kindle Paperwhite (2012)
BeautifulSoup version

The original question asks which version of Beautifulsoup calibre is using. It appears it is version 3. Is there a way to use version 4? If not, when will calibre start using the new version?
sup is offline   Reply With Quote
Old 11-02-2013, 09:10 AM   #13
sup
Connoisseur
sup began at the beginning.
 
Posts: 95
Karma: 10
Join Date: Sep 2013
Device: Kindle Paperwhite (2012)
Beautiful soup is no longer the recomended way to develop recipes, see: https://bugs.launchpad.net/calibre/+bug/1247222
sup is offline   Reply With Quote
Reply

Tags
beautifulsoup, calibre, chron.com, parser, recipe

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Beautiful soup findAll doesn't seem to work Steven630 Recipes 13 08-19-2012 02:44 AM
HTML5 parsing nickredding Conversion 8 08-09-2012 09:50 AM
Parsing Index Steven630 Recipes 0 07-06-2012 04:53 AM
iPad PageList parsing using Javascript. Oh.Danny.Boy Apple Devices 0 05-17-2012 05:24 PM
Parsing Titles cgraving Calibre 3 01-17-2011 02:52 AM


All times are GMT -4. The time now is 04:09 PM.


MobileRead.com is a privately owned, operated and funded community.