Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 04-02-2015, 01:35 PM   #1
truth1ness
Zealot
truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!
 
Posts: 126
Karma: 50000
Join Date: Mar 2015
Device: none
MIT Technology Review print/bimonthly

Could we make an alternate version of the MIT Technology Review that only grabs the print edition articles? http://www.technologyreview.com/magazine/2015/03/

Last edited by truth1ness; 04-03-2015 at 12:40 AM.
truth1ness is offline   Reply With Quote
Old 04-04-2015, 01:21 AM   #2
truth1ness
Zealot
truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!
 
Posts: 126
Karma: 50000
Join Date: Mar 2015
Device: none
Stuck

I'm attempting to make my first recipe for this but I'm stuck.

I've used The Atlantic recipe as a template and below is my code so far.

The "for tag in col.findAll" line is where I think I'm stuck, I can't get it to run the For loop even once no matter what I put in. I printed out "col" and tried all different combinations of "class" names I saw in the output (throwing everything in by the end) but the "Print "START OF TAG"" never gets triggered and I get the error "ValueError: No articles found, aborting".

Any help to proceed would be appreciated.

Code:
#!/usr/bin/env  python2
from __future__ import unicode_literals
__license__   = 'GPL v3'
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
'''
technologyreview.com
'''
import re
from calibre.web.feeds.news import BasicNewsRecipe

class TheAtlantic(BasicNewsRecipe):

    title      = 'MIT Technology Review'
    __author__ = ''
    description = ''
    INDEX = 'http://www.technologyreview.com/magazine/'
    language = 'en'
    encoding = 'utf-8'

    """keep_only_tags = [
        {'attrs':{'class':['article-header', 'article-body', 'article-magazine']}},
    ]
    remove_tags        = [
        {'name': ['meta', 'link', 'noscript']},
        {'attrs':{'class':['offset-wrapper', 'ad-boxfeatures-wrapper']}},
        {'attrs':{'class':lambda x: x and 'article-tools' in x}},
        {'src':lambda x:x and 'spotxchange.com' in x},
    ]
    no_stylesheets = True
    preprocess_regexps = [
        (re.compile(r'<script\b.+?</script>', re.DOTALL), lambda m: ''),
        (re.compile(r'^.*<html', re.DOTALL|re.IGNORECASE), lambda m: '<html'),
    ]

    def print_version(self, url):
        return url + '?single_page=true'"""

    def parse_index(self):
        soup = self.index_to_soup(self.INDEX)
        col = soup.find(attrs={'class':'view-content'})
        current_section, current_articles = None, []
        feeds = []
        print col
        print "END OF COL"
        for tag in col.findAll(name=['h2', 'li'], attrs={'class':['col', 'content-block', 'wrapper', ' ', 'image', 'content-block in-this-issue no-border']}):
            print "START OF TAG"
            print tag
            if tag.name == 'h2':
                if current_section and current_articles:
                    feeds.append((current_section, current_articles))
                current_section = self.tag_to_string(tag).capitalize()
                current_articles = []
                self.log('Found section:', current_section)
            elif current_section:
                a = tag.find('a', href=True)
                if a is not None:
                    title, url = self.tag_to_string(a), a['href']
                    if title and url:
                        p = tag.find('p', attrs={'class':'river-dek'})
                        desc = self.tag_to_string(p) if p is not None else ''
                        current_articles.append({'title':title, 'url':url, 'description':desc})
                        self.log('\tArticle:', title, '[%s]' % url)
                        self.log('\t\t', desc)
        if current_section and current_articles:
            feeds.append((current_section, current_articles))
        return feeds
truth1ness is offline   Reply With Quote
Advert
Old 04-04-2015, 01:27 AM   #3
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,338
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Remove the class constraint in the findAll. If you want to use class then you have to remember that it must match the entire value of the class attribute, not a single class.
kovidgoyal is offline   Reply With Quote
Old 04-04-2015, 03:08 AM   #4
truth1ness
Zealot
truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!
 
Posts: 126
Karma: 50000
Join Date: Mar 2015
Device: none
Thanks. Removing the classes got the for loop printing. I'm not really clear what they were for.

Having trouble in another section. No matter what "a" keeps its value of "None".

Here is the output of one of the tags:

Quote:
Found section: Review
PRINT TAG <a href="/graphiti/534431/pipe-dreams/">
<article>
<img src="http://www.technologyreview.com/sites/default/files/styles/magazine_toc_medium_image/public/images/graphitix392_0.jpg?itok=RNTIQmHG" alt="" />
<h2>Graphiti</h2>
<h1>Pipe Dreams</h1>
</article>
</a>
PRINT CURRENT_SECTION Review
PRINT A None
My understand of a = tag.find('a', href=True) is that if there is a url in the <a href> tag then something should be written in the "a" variable other than "none" and in the Print tag above you can see there is a url: <a href="/graphiti/534431/pipe-dreams/">. I don't get why "a" is remaining "none".

Here's my latest attempt (not much progress):

Code:
#!/usr/bin/env  python2
from __future__ import unicode_literals
__license__   = 'GPL v3'
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
'''
technologyreview.com
'''
import re
from calibre.web.feeds.news import BasicNewsRecipe

class TheAtlantic(BasicNewsRecipe):

    title      = 'MIT Technology Review'
    __author__ = ''
    description = ''
    INDEX = 'http://www.technologyreview.com/magazine/'
    language = 'en'
    encoding = 'utf-8'

    """keep_only_tags = [
        {'attrs':{'class':['article-header', 'article-body', 'article-magazine']}},
    ]
    remove_tags        = [
        {'name': ['meta', 'link', 'noscript']},
        {'attrs':{'class':['offset-wrapper', 'ad-boxfeatures-wrapper']}},
        {'attrs':{'class':lambda x: x and 'article-tools' in x}},
        {'src':lambda x:x and 'spotxchange.com' in x},
    ]
    no_stylesheets = True
    preprocess_regexps = [
        (re.compile(r'<script\b.+?</script>', re.DOTALL), lambda m: ''),
        (re.compile(r'^.*<html', re.DOTALL|re.IGNORECASE), lambda m: '<html'),
    ]

    def print_version(self, url):
        return url + '?single_page=true'"""

    def parse_index(self):
        soup = self.index_to_soup(self.INDEX)
        col = soup.find(attrs={'class':'view-content'})
        current_section, current_articles = None, []
        feeds = []
        print col
        print "END OF COL"
        for tag in col.findAll(name=['a', 'h2'], attrs={}):
            print "PRINT TAG  ", tag
            print "PRINT TAG.NAME", tag.name
            print "PRINT CURRENT_SECTION  ", current_section
            if tag.name == 'h2':
                if current_section and current_articles:
                    feeds.append((current_section, current_articles))
                current_section = self.tag_to_string(tag).capitalize()
                current_articles = []
                self.log('Found section:', current_section)
            elif current_section:
                a = tag.find('a', href=True)
                print "PRINT A  ", a
                if a is not None:
                    title, url = self.tag_to_string(a), a['href']
                    if title and url:
                        p = tag.find('p', attrs={'class':'river-dek'})
                        desc = self.tag_to_string(p) if p is not None else ''
                        current_articles.append({'title':title, 'url':url, 'description':desc})
                        self.log('\tArticle:', title, '[%s]' % url)
                        self.log('\t\t', desc)
        if current_section and current_articles:
            feeds.append((current_section, current_articles))
        return feeds

Last edited by truth1ness; 04-04-2015 at 03:13 AM.
truth1ness is offline   Reply With Quote
Old 04-04-2015, 03:17 AM   #5
truth1ness
Zealot
truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!
 
Posts: 126
Karma: 50000
Join Date: Mar 2015
Device: none
The other thing is they use H2 headers for both the magazine Section (ie IN THIS ISSUE: ANALYSIS) as well as name of regular columns (ie Letter from the editor). How would I differentiate these?

For example
Quote:
<h2>In This Issue: Analysis</h2>
<div class="view view-magazine-toc-section-stories-march-2013 view-id-magazine_toc_section_stories_march_2013 view-display-id-small view-dom-id-add3949f8e0d39e5897a5c6899e8caf5">
<div class="view-content">
<ul> <li class="">
<a href="/fromtheeditor/535051/on-10-breakthrough-technologies/">
<article>
<div class="image">
<img src="http://www.technologyreview.com/sites/default/files/styles/magazine_toc_small_image/public/images/editorx392_11_0.jpg?itok=h9a-G-7-" alt="" />
</div>
<h2>Letter from the Editor</h2>
<h1>On 10 Breakthrough Technologies</h1>
</article>
</a> </li>
<li class="">
<a href="/businessreport/the-future-of-money-2015/">
<article>
<div class="image">
<img src="http://www.technologyreview.com/sites/default/files/styles/magazine_toc_small_image/public/images/big.question.centeredx392_0.jpg?itok=pZbKHs9-" alt="" />
</div>
<h2>Business Report</h2>
<h1>The Future of Money: 2015</h1>

Last edited by truth1ness; 04-04-2015 at 04:47 AM.
truth1ness is offline   Reply With Quote
Advert
Old 04-06-2015, 04:31 AM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,338
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
tag is already an <a> tag, so tag.find('a') will always return None since there is no nested <a> tag inside the <a> tag.
kovidgoyal is offline   Reply With Quote
Old 04-15-2015, 12:26 AM   #7
truth1ness
Zealot
truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!truth1ness is faster than a rolling 'o,' stronger than silent 'e,' and leaps capital 'T' in a single bound!
 
Posts: 126
Karma: 50000
Join Date: Mar 2015
Device: none
Made the recipe! This is my first one so if you can give it a look over and let me know if anything could be improved and see if it's good enough to add to the repository.

This doesn't replace the other "Technology Review" one which is the RSS/website version whereas this one actually takes in the bimonthly magazine. I tested it on the last three months and seemed to work well.
Attached Files
File Type: zip MitTechnologyReview.recipe.zip (1.5 KB, 172 views)
truth1ness is offline   Reply With Quote
Old 04-15-2015, 12:43 AM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,338
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
https://github.com/kovidgoyal/calibr...c769eff66caaa5
kovidgoyal is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
recipe for Technology Review - german schuster Recipes 1 06-05-2016 07:17 AM
Technology Review (United States) Updated bcollier Recipes 1 10-25-2013 10:44 AM
Newbie on technology want to print an e-book bassmanwa Introduce Yourself 4 07-02-2011 04:23 AM
txtr reader vorgestellt in Technology Review 03/09 Alexander Turcic Andere Lesegeräte 9 03-19-2009 10:16 AM
Sony Reader reviewed by MIT Technology Review Bob Russell Sony Reader 38 11-09-2006 05:04 PM


All times are GMT -4. The time now is 06:50 AM.


MobileRead.com is a privately owned, operated and funded community.