need help in trying to update .net recipe

Camper65 · 04-26-2013, 12:33 AM

I'm trying to rewrite the default .net recipe, it seems the way the feeds url is handled changed and it no longer works right anymore.

I've been playing with it, I'm comfortable with HTML and CSS, some php and java but didn't really study python but am getting to understand more of it with doing all this.

This one gets the title of the article and sometimes the descriptions of the articles from the newsfeed but doesn't go any further to pass the actual URL of to the article so that it can pull the whole article.

Spoiler:

(I have commented out the tag area until I can get it working then can modify it to what is needed and not needed).

In trying to get it to pass the url of the feedburner entry I'm trying the following:

Spoiler:

Can anyone help me adjust how to pass the url so that the recipe can convert the feed to an actual URL so that it can download the articles. Unfortunately there are no print versions of these articles so the original must be used. Thanks.

Camper65 · 04-29-2013, 08:58 PM

Found part of the solution, at least now the documents are downloading, now to clean it up before it creates a ebook version. It needed a complete rewrite of the original recipe. Since it's a rewrite, I'm putting my info into it.

So far the code is as follows:

Code:

from calibre.web.feeds.news import BasicNewsRecipe

class dotnetMagazine (BasicNewsRecipe):
    __author__ = u'Bonni Salles'
    __version__ = '1.0'
    __license__   = 'GPL v3'
    __copyright__ = u'2013, Bonni Salles'
    title                 = '.net '
    oldest_article        = 7
    no_stylesheets        = True
    encoding              = 'utf8'
    use_embedded_content  = False
    language              = 'en'
    remove_empty_feeds    = True
    extra_css             = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} '

#   remove_tags_above = dict(id='header')
#   remove_tags_below = [dict(name='footer')]

#   keep_only_tags = [
#         dict(name='article', attrs={'class': re.compile('^node.*$', re.IGNORECASE)}),
#         ]
#   remove_tags = [
#         dict(name='span', attrs={'class': 'comment-count'}),
#         dict(name='div', attrs={'class': 'item-list share-links'}),
#         dict(name='footer'),
#         ]
#   remove_attributes = ['border', 'cellspacing', 'align', 'cellpadding', 'colspan', 'valign', 'vspace', 'hspace', #'alt', 'width', 'height', 'style']
#   extra_css = 'img {max-width: 100%; display: block; margin: auto;} .captioned-image div {text-align: center; #font-style: italic;}'


    feeds = [
               (u'net', u'http://feeds.feedburner.com/net/topstories')
            ]

Now to read on how to remove tags before it processing the html, there's a lot on the page that is not needed. It took a week to figure out that the recipe needed the complete rewrite.

Camper65 · 05-10-2013, 01:01 AM

Got it, at least it's almost a perfect recipe. Right now it still show the comments when they are there, but most of the top and bottom have been eliminated. I'd love to figure out how to use remove_tags_before when there is no id or class for it --> <header> is how the header is designated and I couldn't get it to work right, ended up using remove_tags for almost everything.

Here is the recipe at this point in time.

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, BeautifulStoneSoup


class dotnetMagazine (BasicNewsRecipe):
    __author__ = u'Bonni Salles - post in forum if questions for me'
    __version__ = '1.0'
    __license__   = 'GPL v3'
    __copyright__ = u'2013, Bonni Salles'
    title                 = '.net '
    oldest_article        = 7
    no_stylesheets        = True
    encoding              = 'utf8'
    use_embedded_content  = False
    language              = 'en'
    remove_empty_feeds    = True
    extra_css             = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} '

    remove_tags_after = [
         dict(name='div', attrs={'class': 'footer-content'}),
          ]

    #remove_tags_before = [
    #     dict(name='div', attrs={'id': 'main-content'}),
    #     ]
          
    remove_tags = [
         dict(name='div', attrs={'class': 'item-list'}),
         dict(name=['header','footer']),
         dict(attrs={'class':re.compile('(^|| )menu($|| )', re.DOTALL)}),
         dict(name='h4', attrs={'class': 'std-hdr'}),
         dict(name=['script', 'noscript']),
         dict(name='div', attrs={'id': 'comments-form'}),
         dict(name='div', attrs={'id': re.compile('advertorial_block_($|| )')}),
         dict(name='div', attrs={'id': 'right-col'}),

         ]
#   remove_attributes = ['border', 'cellspacing', 'align', 'cellpadding', 'colspan', 'valign', 'vspace', 'hspace', #'alt', 'width', 'height', 'style']
#   extra_css = 'img {max-width: 100%; display: block; margin: auto;} .captioned-image div {text-align: center; #font-style: italic;}'


    feeds = [
               (u'net', u'http://feeds.feedburner.com/net/topstories')
            ]

Here is the area of the article that I'm trying to work with

Code:

</ul>					</nav>
                </div>

                <div id="main-content">
                  <div id="content" >
                     
                     
                                                                  
                     
                     
                     <article class="node node-news sticky" >

   <header>
                           <h1 class="title"><span>GAAD 2013 wants accessibility on web devs' minds</span></h1>
               
      <div class="submitted" >
                     By <span class="author-name">Craig Grannell</span> on <time datetime="2013-05-09T11:35:55+00:00" >May 09, 2013</time>                             <div class="item-list share-links" ><h3>Share this article</h3><ul><li class="twitter-button first"><a href="http://twitter.com/share?url=http://www.netmagazine.com/news/gaad-2013-wants-accessibility-web-devs-minds-132742&amp;text=GAAD%202013%20wants%20accessibility%20on%20web%20devs%27%20minds | .net" class="twitter-share-button" data-count="none">Tweet</a></li>
<li class="facebook-button"><iframe src="http://www.facebook.com/plugins/like.php?href=http://www.netmagazine.com/news/gaad-2013-wants-accessibility-web-devs-minds-132742&amp;send=false&amp;layout=button_count&amp;width=47&amp;show_faces=false&amp;action=like&amp;colorscheme=light&amp;font=arial&amp;height=21" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:47px; height:21px;" allowTransparency="true"></iframe></li>
<li class="googleplus-button"><g:plusone size="medium" count="false" ></g:plusone></li>
<li class="linkedin-button"><script type="in/share"  ></script> </li>
<li class="shorturl-button inactive last"><div class="shorturl-box"><input type="text" value="http://netm.ag/15MLtx1" /><div class="shorturl-close"></div></div><span class="shorturl-link">Short url</span></li>
</ul></div>              </div>
          
   </header>
   
   <div class="content">

Can someone please tell me what to do to get the remove_tags_before to work. There is also an area with <header id="header"> that is in the beginning which is not where I want to have the article start from.

Camper65 · 05-10-2013, 01:02 AM

Got it, at least it's almost a perfect recipe. Right now it still show the comments when they are there, but most of the top and bottom have been eliminated. I'd love to figure out how to use remove_tags_before when there is no id or class for it --> <header> is how the header is designated and I couldn't get it to work right, ended up using remove_tags for almost everything.

Here is the recipe at this point in time.

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, BeautifulStoneSoup


class dotnetMagazine (BasicNewsRecipe):
    __author__ = u'Bonni Salles - post in forum if questions for me'
    __version__ = '1.0'
    __license__   = 'GPL v3'
    __copyright__ = u'2013, Bonni Salles'
    title                 = '.net '
    oldest_article        = 7
    no_stylesheets        = True
    encoding              = 'utf8'
    use_embedded_content  = False
    language              = 'en'
    remove_empty_feeds    = True
    extra_css             = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} '

    remove_tags_after = [
         dict(name='div', attrs={'class': 'footer-content'}),
          ]

    #remove_tags_before = [
    #     dict(name='div', attrs={'id': 'main-content'}),
    #     ]
          
    remove_tags = [
         dict(name='div', attrs={'class': 'item-list'}),
         dict(name=['header','footer']),
         dict(attrs={'class':re.compile('(^|| )menu($|| )', re.DOTALL)}),
         dict(name='h4', attrs={'class': 'std-hdr'}),
         dict(name=['script', 'noscript']),
         dict(name='div', attrs={'id': 'comments-form'}),
         dict(name='div', attrs={'id': re.compile('advertorial_block_($|| )')}),
         dict(name='div', attrs={'id': 'right-col'}),

         ]
#   remove_attributes = ['border', 'cellspacing', 'align', 'cellpadding', 'colspan', 'valign', 'vspace', 'hspace', #'alt', 'width', 'height', 'style']
#   extra_css = 'img {max-width: 100%; display: block; margin: auto;} .captioned-image div {text-align: center; #font-style: italic;}'


    feeds = [
               (u'net', u'http://feeds.feedburner.com/net/topstories')
            ]

Here is the area of the article that I'm trying to work with

Code:

</ul>					</nav>
                </div>

                <div id="main-content">
                  <div id="content" >
                     
                     
                                                                  
                     
                     
                     <article class="node node-news sticky" >

   <header>
                           <h1 class="title"><span>GAAD 2013 wants accessibility on web devs' minds</span></h1>
               
      <div class="submitted" >
                     By <span class="author-name">Craig Grannell</span> on <time datetime="2013-05-09T11:35:55+00:00" >May 09, 2013</time>                             <div class="item-list share-links" ><h3>Share this article</h3><ul><li class="twitter-button first"><a href="http://twitter.com/share?url=http://www.netmagazine.com/news/gaad-2013-wants-accessibility-web-devs-minds-132742&amp;text=GAAD%202013%20wants%20accessibility%20on%20web%20devs%27%20minds | .net" class="twitter-share-button" data-count="none">Tweet</a></li>
<li class="facebook-button"><iframe src="http://www.facebook.com/plugins/like.php?href=http://www.netmagazine.com/news/gaad-2013-wants-accessibility-web-devs-minds-132742&amp;send=false&amp;layout=button_count&amp;width=47&amp;show_faces=false&amp;action=like&amp;colorscheme=light&amp;font=arial&amp;height=21" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:47px; height:21px;" allowTransparency="true"></iframe></li>
<li class="googleplus-button"><g:plusone size="medium" count="false" ></g:plusone></li>
<li class="linkedin-button"><script type="in/share"  ></script> </li>
<li class="shorturl-button inactive last"><div class="shorturl-box"><input type="text" value="http://netm.ag/15MLtx1" /><div class="shorturl-close"></div></div><span class="shorturl-link">Short url</span></li>
</ul></div>              </div>
          
   </header>
   
   <div class="content">

Can someone please tell me what to do to get the remove_tags_before to work. There is also an area with <header id="header"> that is in the beginning which is not where I want to have the article start from.

kovidgoyal · 05-10-2013, 03:04 AM

remove_tags_before = dict(name='header', id=lambda x:not x)

will match a <header> tag with no id.

Camper65 · 05-11-2013, 10:13 PM

Kovid, thank you, that's what was needed.

The recipe is now fixed and works. Here is the final version if you want to use it in the program.

Spoiler:

Camper65 · 05-26-2013, 09:50 AM

New issue and I'm not sure it's just a problem this week or not. I had to add recursion = 1 to force it to download the article. The feed site now has an ad page that apparently comes up first and then you have to "Click here to continue to article". How can I have it automatically avoid that first page or in other words, get the right link from that ad page?

(here's the modified recipe to try)

Spoiler:

kovidgoyal · 05-26-2013, 10:04 AM

You need to use the obfuscated articles infrastructure in the recipe. articles_are_obfuscated = True and them implement get_obfuscated_article() in your recipe.

Camper65 · 05-26-2013, 03:15 PM

Kovid,

I added the following to the test recipe and it's not working.

At the top end with the other true/false, etc. area
articles_are_obfuscated = True

Just before the feeds:
def get_obfuscated_article(self, url):
raise NotImplementedError

and it get the following in the recipe.txt printout:

Spoiler:

What am I missing in the middle of these lines?
def get_obfuscated_article(self, url):
raise NotImplementedError

kovidgoyal · 05-27-2013, 12:32 AM

Look ata builtin recipe that uses get_obfuscated_article or read the API documentation for that function.

Camper65 · 05-27-2013, 12:16 PM

Got the corrected one again. But give me a week or two to make sure that the changes feedsportal made are permanent and not just a fluke, since I now only download this once a week (they only produce articles five days a week and it's better downloading Saturday or Sunday for it to get that week's articles.

I'll let you know next week hopefully, in between installing components to my new tower I'm building.

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, BeautifulStoneSoup
from calibre.web.feeds import feed_from_xml, templates, feeds_from_index, Feed
from calibre.ptempfile import PersistentTemporaryFile

class dotnetMagazine (BasicNewsRecipe):
    __author__ = u'Bonni Salles - post in forum if questions for me'
    __version__ = '1.1'
    __license__   = 'GPL v3'
    __copyright__ = u'2013, Bonni Salles'
    title                 = '.net '
    oldest_article        = 7
    no_stylesheets        = True
    encoding              = 'utf8'
    use_embedded_content  = False
    recursion = 1
    articles_are_obfuscated = True
    language              = 'en'
    remove_empty_feeds    = True
    extra_css             = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} '
    cover_url = u'http://media.netmagazine.futurecdn.net/sites/all/themes/netmag/logo.png'

    remove_tags_after = dict(name='footer', id=lambda x:not x)     
    remove_tags_before = dict(name='header', id=lambda x:not x)


    remove_tags = [
         dict(name='div', attrs={'class': 'item-list'}),
         dict(name='h4', attrs={'class': 'std-hdr'}),
         dict(name='div', attrs={'class': 'item-list share-links'}), #removes share links
         dict(name=['script', 'noscript']),
         dict(name='div', attrs={'id': 'comments-form'}), #comment these out if you want the comments to show
         dict(name='div', attrs={'id': re.compile('advertorial_block_($|| )')}),
         dict(name='div', attrs={'id': 'right-col'}),
         dict(name='div', attrs={'id': 'comments'}), #comment these out if you want the comments to show
         dict(name='div', attrs={'class': 'item-list related-content'}),

         ]
         
    feeds = [
               (u'net', u'http://feeds.feedburner.com/net/topstories?format=xml')
            ]

    temp_files = []
    
    def get_obfuscated_article(self, url):
        br = self.get_browser()
        print 'THE CURRENT URL IS: ', url
        br.open(url)
        response = br.open(url)
        html = response.read()
         
        self.temp_files.append(PersistentTemporaryFile('_fa.html'))
        self.temp_files[-1].write(html)
        self.temp_files[-1].close()
        return self.temp_files[-1].name

Camper65

kovidgoyal · 05-27-2013, 12:38 PM

Dont you need to actually parse the returned html to see if it contains an ad and find the correct article in that case? Then you dont need recursion = 1 any more.

Camper65 · 05-27-2013, 01:30 PM

Perhaps I didn't explain the initial change properly. When you click on a feed initially (at least the first time I did it) it took me to a page with an ad in the middle and on the right hand side it had in multiple languages "Click here to continue to article" which when you click takes you directly to the article. All I saw when this week's .net downloaded was a batch of pages of the multiple languages "Click here to continue to article" (with no text for the article).

When I added Recursion = 1, it gave me the first page with the right hand side info of the multiple languages "Click here to continue to article" and after that page it gave me the actual article as a new entry. But in between each article was again the click here to continue to article page. This now gets the articles only. But that's why I'm waiting a week to see if the feed changes are permanent before I'm sure the changes are needed.

If you know of another way to get around this double clicking to get the article let me know please. If you want me to send you an epub so you an see what it is doing, let me know.

kovidgoyal · 05-27-2013, 01:48 PM

Ah, I see, just ad this to your recipe:

Code:

   def skip_ad_pages(self, soup):
        text = soup.find(text='click here to continue to article')
        if text:
            a = text.parent
            url = a.get('href')
            if url:
                return self.index_to_soup(url, raw=True)

Camper65 · 05-27-2013, 05:09 PM

Kovid,

Thank you!!!! That did it. The articles now download like normal and do not have that extra page in there.

Here is the updated recipe again so you can use it next time you do updates to Calibre.

Also, thank you for creating such a great ebook organizer/news download program.

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, BeautifulStoneSoup
from calibre.web.feeds import feed_from_xml, templates, feeds_from_index, Feed
from calibre.ptempfile import PersistentTemporaryFile

class dotnetMagazine (BasicNewsRecipe):
    __author__ = u'Bonni Salles - post in forum if questions for me'
    __version__ = '1.1'
    __license__   = 'GPL v3'
    __copyright__ = u'2013, Bonni Salles'
    title                 = '.net magazine'
    oldest_article        = 7
    no_stylesheets        = True
    encoding              = 'utf8'
    use_embedded_content  = False
    #recursion = 1
    language              = 'en'
    remove_empty_feeds    = True
    extra_css             = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} '
    cover_url = u'http://media.netmagazine.futurecdn.net/sites/all/themes/netmag/logo.png'

    remove_tags_after = dict(name='footer', id=lambda x:not x)     
    remove_tags_before = dict(name='header', id=lambda x:not x)


    remove_tags = [
         dict(name='div', attrs={'class': 'item-list'}),
         dict(name='h4', attrs={'class': 'std-hdr'}),
         dict(name='div', attrs={'class': 'item-list share-links'}), #removes share links
         dict(name=['script', 'noscript']),
         dict(name='div', attrs={'id': 'comments-form'}), #comment these out if you want the comments to show
         dict(name='div', attrs={'id': re.compile('advertorial_block_($|| )')}),
         dict(name='div', attrs={'id': 'right-col'}),
         dict(name='div', attrs={'id': 'comments'}), #comment these out if you want the comments to show
         dict(name='div', attrs={'class': 'item-list related-content'}),

         ]
         
    feeds = [
               (u'net', u'http://feeds.feedburner.com/net/topstories?format=xml')
            ]
  
    def skip_ad_pages(self, soup):
          text = soup.find(text='click here to continue to article')
          if text:
              a = text.parent
              url = a.get('href')
              if url:
                return self.index_to_soup(url, raw=True)

04-26-2013, 12:33 AM	#1
Camper65 Enthusiast Posts: 32 Karma: 10 Join Date: Apr 2011 Device: Kindle wifi; Dell 2in1	need help in trying to update .net recipe I'm trying to rewrite the default .net recipe, it seems the way the feeds url is handled changed and it no longer works right anymore. I've been playing with it, I'm comfortable with HTML and CSS, some php and java but didn't really study python but am getting to understand more of it with doing all this. This one gets the title of the article and sometimes the descriptions of the articles from the newsfeed but doesn't go any further to pass the actual URL of to the article so that it can pull the whole article. Spoiler: # vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai from calibre.web.feeds.news import BasicNewsRecipe import re class NetMagazineRecipe (BasicNewsRecipe): __author__ = u'Marc Busqué <marc@lamarciana.com>' __url__ = 'http://www.lamarciana.com' __version__ = '1.0' __license__ = 'GPL v3' __copyright__ = u'2012, Marc Busqué <marc@lamarciana.com>' title = u'.net magazine Custom' description = u'net is the world’s best-selling magazine for web designers and developers, featuring tutorials from leading agencies, interviews with the web’s biggest names, and agenda-setting features on the hottest issues affecting the internet today.' language = 'en' tags = 'web development, software' oldest_article = 7 remove_empty_feeds = True no_stylesheets = True auto_cleanup = True cover_url = u'http://media.netmagazine.futurecdn.net/sites/all/themes/netmag/logo.png' # remove_tags_above = dict(id='header') # remove_tags_below = [dict(name='footer')] # keep_only_tags = [ # dict(name='article', attrs={'class': re.compile('^node.$', re.IGNORECASE)}), # ] # remove_tags = [ # dict(name='span', attrs={'class': 'comment-count'}), # dict(name='div', attrs={'class': 'item-list share-links'}), # dict(name='footer'), # ] # remove_attributes = ['border', 'cellspacing', 'align', 'cellpadding', 'colspan', 'valign', 'vspace', 'hspace', #'alt', 'width', 'height', 'style'] # extra_css = 'img {max-width: 100%; display: block; margin: auto;} .captioned-image div {text-align: center; #font-style: italic;}' feeds = [ (u'.net', u'http://feeds.feedburner.com/net/topstories?format=xml'), ] (I have commented out the tag area until I can get it working then can modify it to what is needed and not needed). In trying to get it to pass the url of the feedburner entry I'm trying the following: Spoiler: # vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai from calibre.web.feeds.news import BasicNewsRecipe import re class NetMagazineRecipe (BasicNewsRecipe): __author__ = u'Marc Busqué <marc@lamarciana.com>' __url__ = 'http://www.lamarciana.com' __version__ = '1.0' __license__ = 'GPL v3' __copyright__ = u'2012, Marc Busqué <marc@lamarciana.com>' title = u'.net magazine Custom' description = u'net is the world’s best-selling magazine for web designers and developers, featuring tutorials from leading agencies, interviews with the web’s biggest names, and agenda-setting features on the hottest issues affecting the internet today.' language = 'en' tags = 'web development, software' oldest_article = 7 remove_empty_feeds = True no_stylesheets = True auto_cleanup = True cover_url = u'http://media.netmagazine.futurecdn.net/sites/all/themes/netmag/logo.png' # remove_tags_above = dict(id='header') # remove_tags_below = [dict(name='footer')] # keep_only_tags = [ # dict(name='article', attrs={'class': re.compile('^node.$', re.IGNORECASE)}), # ] # remove_tags = [ # dict(name='span', attrs={'class': 'comment-count'}), # dict(name='div', attrs={'class': 'item-list share-links'}), # dict(name='footer'), # ] # remove_attributes = ['border', 'cellspacing', 'align', 'cellpadding', 'colspan', 'valign', 'vspace', 'hspace', #'alt', 'width', 'height', 'style'] # extra_css = 'img {max-width: 100%; display: block; margin: auto;} .captioned-image div {text-align: center; #font-style: italic;}' feeds = [ (u'.net', u'http://feeds.feedburner.com/net/topstories?format=xml'), ] def get_article_url(self, article): url = article.get('link', None) return url Can anyone help me adjust how to pass the url so that the recipe can convert the feed to an actual URL so that it can download the articles. Unfortunately there are no print versions of these articles so the original must be used. Thanks.

05-11-2013, 10:13 PM	#6
Camper65 Enthusiast Posts: 32 Karma: 10 Join Date: Apr 2011 Device: Kindle wifi; Dell 2in1	Success Kovid, thank you, that's what was needed. The recipe is now fixed and works. Here is the final version if you want to use it in the program. Spoiler: from calibre.web.feeds.news import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, BeautifulStoneSoup class dotnetMagazine (BasicNewsRecipe): __author__ = u'Bonni Salles - post in forum if questions for me' __version__ = '1.0' __license__ = 'GPL v3' __copyright__ = u'2013, Bonni Salles' title = '.net ' oldest_article = 7 no_stylesheets = True encoding = 'utf8' use_embedded_content = False language = 'en' remove_empty_feeds = True extra_css = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} ' cover_url = u'http://media.netmagazine.futurecdn.net/sites/all/themes/netmag/logo.png' remove_tags_after = dict(name='footer', id=lambda x:not x) remove_tags_before = dict(name='header', id=lambda x:not x) remove_tags = [ dict(name='div', attrs={'class': 'item-list'}), dict(name='h4', attrs={'class': 'std-hdr'}), dict(name='div', attrs={'class': 'item-list share-links'}), #removes share links dict(name=['script', 'noscript']), dict(name='div', attrs={'id': 'comments-form'}), #comment these out if you want the comments to show dict(name='div', attrs={'id': re.compile('advertorial_block_($\|\| )')}), dict(name='div', attrs={'id': 'right-col'}), dict(name='div', attrs={'id': 'comments'}), #comment these out if you want the comments to show dict(name='div', attrs={'class': 'item-list related-content'}), ] feeds = [ (u'net', u'http://feeds.feedburner.com/net/topstories') ] Last edited by Camper65; 05-12-2013 at 12:39 AM.

05-26-2013, 09:50 AM	#7
Camper65 Enthusiast Posts: 32 Karma: 10 Join Date: Apr 2011 Device: Kindle wifi; Dell 2in1	New issue and I'm not sure it's just a problem this week or not. I had to add recursion = 1 to force it to download the article. The feed site now has an ad page that apparently comes up first and then you have to "Click here to continue to article". How can I have it automatically avoid that first page or in other words, get the right link from that ad page? (here's the modified recipe to try) Spoiler: from calibre.web.feeds.news import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, BeautifulStoneSoup class dotnetMagazine (BasicNewsRecipe): __author__ = u'Bonni Salles - post in forum if questions for me' __version__ = '1.0' __license__ = 'GPL v3' __copyright__ = u'2013, Bonni Salles' title = '.net magazine' oldest_article = 7 no_stylesheets = True recursions = 1 encoding = 'utf8' use_embedded_content = False language = 'en' remove_empty_feeds = True extra_css = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} ' cover_url = u'http://media.netmagazine.futurecdn.net/sites/all/themes/netmag/logo.png' remove_tags_after = dict(name='footer', id=lambda x:not x) remove_tags_before = dict(name='header', id=lambda x:not x) remove_tags = [ dict(name='div', attrs={'class': 'item-list'}), dict(name='h4', attrs={'class': 'std-hdr'}), dict(name='div', attrs={'class': 'item-list share-links'}), dict(name=['script', 'noscript']), dict(name='div', attrs={'id': 'comments-form'}), dict(name='div', attrs={'id': re.compile('advertorial_block_($\|\| )')}), dict(name='div', attrs={'id': 'right-col'}), dict(name='div', attrs={'id': 'comments'}), dict(name='div', attrs={'class': 'item-list related-content'}), ] feeds = [ (u'net', u'http://feeds.feedburner.com/net/topstories') ]

05-26-2013, 03:15 PM	#9
Camper65 Enthusiast Posts: 32 Karma: 10 Join Date: Apr 2011 Device: Kindle wifi; Dell 2in1	Kovid, I added the following to the test recipe and it's not working. At the top end with the other true/false, etc. area articles_are_obfuscated = True Just before the feeds: def get_obfuscated_article(self, url): raise NotImplementedError and it get the following in the recipe.txt printout: Spoiler: Resolved conversion options calibre version: 0.9.28 {'asciiize': False, 'author_sort': None, 'authors': None, 'base_font_size': 0, 'book_producer': None, 'change_justification': 'original', 'chapter': None, 'chapter_mark': 'pagebreak', 'comments': None, 'cover': None, 'debug_pipeline': None, 'dehyphenate': True, 'delete_blank_paragraphs': True, 'disable_font_rescaling': False, 'dont_download_recipe': False, 'duplicate_links_in_toc': False, 'embed_font_family': None, 'enable_heuristics': False, 'extra_css': None, 'filter_css': None, 'fix_indents': True, 'font_size_mapping': None, 'format_scene_breaks': True, 'html_unwrap_factor': 0.4, 'input_encoding': None, 'input_profile': <calibre.customize.profiles.InputProfile object at 0x021E59B0>, 'insert_blank_line': False, 'insert_blank_line_size': 0.5, 'insert_metadata': False, 'isbn': None, 'italicize_common_cases': True, 'keep_ligatures': False, 'language': None, 'level1_toc': None, 'level2_toc': None, 'level3_toc': None, 'line_height': 0, 'linearize_tables': False, 'lrf': False, 'margin_bottom': 5.0, 'margin_left': 5.0, 'margin_right': 5.0, 'margin_top': 5.0, 'markup_chapter_headings': True, 'max_toc_links': 50, 'minimum_line_height': 120.0, 'no_chapters_in_toc': False, 'no_inline_navbars': False, 'output_profile': <calibre.customize.profiles.OutputProfile object at 0x021E5B90>, 'page_breaks_before': None, 'prefer_metadata_cover': False, 'pretty_print': True, 'pubdate': None, 'publisher': None, 'rating': None, 'read_metadata_from_opf': None, 'remove_fake_margins': True, 'remove_first_image': False, 'remove_paragraph_spacing': False, 'remove_paragraph_spacing_indent_size': 1.5, 'renumber_headings': True, 'replace_scene_breaks': '', 'search_replace': None, 'series': None, 'series_index': None, 'smarten_punctuation': False, 'sr1_replace': '', 'sr1_search': '', 'sr2_replace': '', 'sr2_search': '', 'sr3_replace': '', 'sr3_search': '', 'start_reading_at': None, 'subset_embedded_fonts': False, 'tags': None, 'test': True, 'timestamp': None, 'title': None, 'title_sort': None, 'toc_filter': None, 'toc_threshold': 6, 'unsmarten_punctuation': False, 'unwrap_lines': True, 'use_auto_toc': False, 'verbose': 2} 1% Converting input to HTML... InputFormatPlugin: Recipe Input running Using custom recipe 1% Fetching feeds... 1% Fetching feed net... 1% Trying to download cover... 34% Downloading cover from http://media.netmagazine.futurecdn.n...etmag/logo.png 1% Generating masthead... Synthesizing mastheadImage 1% Starting download [4 thread(s)]... Failed to download article: Designer says 'stop using Helvetica and Arial' from http://rss.feedsportal.com/c/32632/f...69/story01.htm Traceback (most recent call last): File "site-packages\calibre\utils\threadpool.py", line 95, in run File "site-packages\calibre\web\feeds\news.py", line 1107, in fetch_obfuscated_article File "<string>", line 40, in get_obfuscated_article NotImplementedError 17% Article download failed: Designer says 'stop using Helvetica and Arial' Failed to download article: The .net strip #36: Roger Federer from http://rss.feedsportal.com/c/32632/f...er/story01.htm Traceback (most recent call last): File "site-packages\calibre\utils\threadpool.py", line 95, in run File "site-packages\calibre\web\feeds\news.py", line 1107, in fetch_obfuscated_article File "<string>", line 40, in get_obfuscated_article NotImplementedError 34% Article download failed: The .net strip #36: Roger Federer 34% Feeds downloaded to G:\Users\Camper\AppData\Local\Temp\calibre_hm5czb\ m_zrcy_plumber\index.html 34% Download finished Failed to download the following articles: Designer says 'stop using Helvetica and Arial' from net http://rss.feedsportal.com/c/32632/f...69/story01.htm Traceback (most recent call last): File "site-packages\calibre\utils\threadpool.py", line 95, in run File "site-packages\calibre\web\feeds\news.py", line 1107, in fetch_obfuscated_article File "<string>", line 40, in get_obfuscated_article NotImplementedError The .net strip #36: Roger Federer from net http://rss.feedsportal.com/c/32632/f...er/story01.htm Traceback (most recent call last): File "site-packages\calibre\utils\threadpool.py", line 95, in run File "site-packages\calibre\web\feeds\news.py", line 1107, in fetch_obfuscated_article File "<string>", line 40, in get_obfuscated_article NotImplementedError Parsing all content... Parsing index.html ... Forcing index.html into XHTML namespace Parsing feed_0/index.html ... Initial parse failed, using more forgiving parsers Parsing feed_0/index.html as HTML Reading TOC from NCX... 34% Running transforms on ebook... Merging user specified metadata... Detecting structure... Flattening CSS and remapping font sizes... Source base font size is 12.00000pt Removing fake margins... Found 3 items of level: div_1 Found 2 items of level: div_2 Found 2 items of level: p_2 Ignoring level p_2 div_1 left margin stats: Counter() div_1 right margin stats: Counter() div_2 left margin stats: Counter() div_2 right margin stats: Counter() Cleaning up manifest... Trimming unused files from manifest... Creating OEB Output... 67% Running OEB Output plugin The cover image has an id != "cover". Renaming to work around bug in Nook Color OEB output written to G:\Users\Camper\Documents\Calibre Library\Testing news\myrecipe Output saved to G:\Users\Camper\Documents\Calibre Library\Testing news\myrecipe What am I missing in the middle of these lines? def get_obfuscated_article(self, url): raise NotImplementedError

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
.net recipe suddenly not working right	Camper65	Recipes	0	04-21-2013 02:16 PM
.net magazine recipe	cram1010	Recipes	0	07-21-2012 10:26 AM
Modified Recipe Tweakers.net - need help	roedi06	Recipes	4	01-17-2012 08:42 AM
recipe for FAZ.net - german	schuster	Recipes	10	05-28-2011 01:13 AM
Request: Inquirer.net Recipe update	zoilom	Recipes	0	12-21-2010 02:06 AM

05-10-2013, 03:04 AM	#5
kovidgoyal creator of calibre Posts: 45,716 Karma: 28549306 Join Date: Oct 2006 Location: Mumbai, India Device: Various	remove_tags_before = dict(name='header', id=lambda x:not x) will match a <header> tag with no id.

05-26-2013, 10:04 AM	#8
kovidgoyal creator of calibre Posts: 45,716 Karma: 28549306 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You need to use the obfuscated articles infrastructure in the recipe. articles_are_obfuscated = True and them implement get_obfuscated_article() in your recipe.

05-27-2013, 12:32 AM	#10
kovidgoyal creator of calibre Posts: 45,716 Karma: 28549306 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Look ata builtin recipe that uses get_obfuscated_article or read the API documentation for that function.

05-27-2013, 12:38 PM	#12
kovidgoyal creator of calibre Posts: 45,716 Karma: 28549306 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Dont you need to actually parse the returned html to see if it contains an ad and find the correct article in that case? Then you dont need recursion = 1 any more.

05-27-2013, 01:30 PM	#13
Camper65 Enthusiast Posts: 32 Karma: 10 Join Date: Apr 2011 Device: Kindle wifi; Dell 2in1	Perhaps I didn't explain the initial change properly. When you click on a feed initially (at least the first time I did it) it took me to a page with an ad in the middle and on the right hand side it had in multiple languages "Click here to continue to article" which when you click takes you directly to the article. All I saw when this week's .net downloaded was a batch of pages of the multiple languages "Click here to continue to article" (with no text for the article). When I added Recursion = 1, it gave me the first page with the right hand side info of the multiple languages "Click here to continue to article" and after that page it gave me the actual article as a new entry. But in between each article was again the click here to continue to article page. This now gets the articles only. But that's why I'm waiting a week to see if the feed changes are permanent before I'm sure the changes are needed. If you know of another way to get around this double clicking to get the article let me know please. If you want me to send you an epub so you an see what it is doing, let me know.

05-27-2013, 01:48 PM	#14
kovidgoyal creator of calibre Posts: 45,716 Karma: 28549306 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Ah, I see, just ad this to your recipe: Code: def skip_ad_pages(self, soup): text = soup.find(text='click here to continue to article') if text: a = text.parent url = a.get('href') if url: return self.index_to_soup(url, raw=True)

Advert

Advert