View Single Post
Old 04-04-2015, 03:08 AM   #4
truth1ness
Zealot
truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!
 
Posts: 126
Karma: 50000
Join Date: Mar 2015
Device: none
Thanks. Removing the classes got the for loop printing. I'm not really clear what they were for.

Having trouble in another section. No matter what "a" keeps its value of "None".

Here is the output of one of the tags:

Quote:
Found section: Review
PRINT TAG <a href="/graphiti/534431/pipe-dreams/">
<article>
<img src="http://www.technologyreview.com/sites/default/files/styles/magazine_toc_medium_image/public/images/graphitix392_0.jpg?itok=RNTIQmHG" alt="" />
<h2>Graphiti</h2>
<h1>Pipe Dreams</h1>
</article>
</a>
PRINT CURRENT_SECTION Review
PRINT A None
My understand of a = tag.find('a', href=True) is that if there is a url in the <a href> tag then something should be written in the "a" variable other than "none" and in the Print tag above you can see there is a url: <a href="/graphiti/534431/pipe-dreams/">. I don't get why "a" is remaining "none".

Here's my latest attempt (not much progress):

Code:
#!/usr/bin/env  python2
from __future__ import unicode_literals
__license__   = 'GPL v3'
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
'''
technologyreview.com
'''
import re
from calibre.web.feeds.news import BasicNewsRecipe

class TheAtlantic(BasicNewsRecipe):

    title      = 'MIT Technology Review'
    __author__ = ''
    description = ''
    INDEX = 'http://www.technologyreview.com/magazine/'
    language = 'en'
    encoding = 'utf-8'

    """keep_only_tags = [
        {'attrs':{'class':['article-header', 'article-body', 'article-magazine']}},
    ]
    remove_tags        = [
        {'name': ['meta', 'link', 'noscript']},
        {'attrs':{'class':['offset-wrapper', 'ad-boxfeatures-wrapper']}},
        {'attrs':{'class':lambda x: x and 'article-tools' in x}},
        {'src':lambda x:x and 'spotxchange.com' in x},
    ]
    no_stylesheets = True
    preprocess_regexps = [
        (re.compile(r'<script\b.+?</script>', re.DOTALL), lambda m: ''),
        (re.compile(r'^.*<html', re.DOTALL|re.IGNORECASE), lambda m: '<html'),
    ]

    def print_version(self, url):
        return url + '?single_page=true'"""

    def parse_index(self):
        soup = self.index_to_soup(self.INDEX)
        col = soup.find(attrs={'class':'view-content'})
        current_section, current_articles = None, []
        feeds = []
        print col
        print "END OF COL"
        for tag in col.findAll(name=['a', 'h2'], attrs={}):
            print "PRINT TAG  ", tag
            print "PRINT TAG.NAME", tag.name
            print "PRINT CURRENT_SECTION  ", current_section
            if tag.name == 'h2':
                if current_section and current_articles:
                    feeds.append((current_section, current_articles))
                current_section = self.tag_to_string(tag).capitalize()
                current_articles = []
                self.log('Found section:', current_section)
            elif current_section:
                a = tag.find('a', href=True)
                print "PRINT A  ", a
                if a is not None:
                    title, url = self.tag_to_string(a), a['href']
                    if title and url:
                        p = tag.find('p', attrs={'class':'river-dek'})
                        desc = self.tag_to_string(p) if p is not None else ''
                        current_articles.append({'title':title, 'url':url, 'description':desc})
                        self.log('\tArticle:', title, '[%s]' % url)
                        self.log('\t\t', desc)
        if current_section and current_articles:
            feeds.append((current_section, current_articles))
        return feeds

Last edited by truth1ness; 04-04-2015 at 03:13 AM.
truth1ness is offline   Reply With Quote