04-25-2013, 11:33 PM | #1 |
Enthusiast
Posts: 32
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Dell 2in1
|
need help in trying to update .net recipe
I'm trying to rewrite the default .net recipe, it seems the way the feeds url is handled changed and it no longer works right anymore.
I've been playing with it, I'm comfortable with HTML and CSS, some php and java but didn't really study python but am getting to understand more of it with doing all this. This one gets the title of the article and sometimes the descriptions of the articles from the newsfeed but doesn't go any further to pass the actual URL of to the article so that it can pull the whole article. Spoiler:
(I have commented out the tag area until I can get it working then can modify it to what is needed and not needed). In trying to get it to pass the url of the feedburner entry I'm trying the following: Spoiler:
Can anyone help me adjust how to pass the url so that the recipe can convert the feed to an actual URL so that it can download the articles. Unfortunately there are no print versions of these articles so the original must be used. Thanks. |
04-29-2013, 07:58 PM | #2 |
Enthusiast
Posts: 32
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Dell 2in1
|
Found part of the solution, at least now the documents are downloading, now to clean it up before it creates a ebook version. It needed a complete rewrite of the original recipe. Since it's a rewrite, I'm putting my info into it.
So far the code is as follows: Code:
from calibre.web.feeds.news import BasicNewsRecipe class dotnetMagazine (BasicNewsRecipe): __author__ = u'Bonni Salles' __version__ = '1.0' __license__ = 'GPL v3' __copyright__ = u'2013, Bonni Salles' title = '.net ' oldest_article = 7 no_stylesheets = True encoding = 'utf8' use_embedded_content = False language = 'en' remove_empty_feeds = True extra_css = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} ' # remove_tags_above = dict(id='header') # remove_tags_below = [dict(name='footer')] # keep_only_tags = [ # dict(name='article', attrs={'class': re.compile('^node.*$', re.IGNORECASE)}), # ] # remove_tags = [ # dict(name='span', attrs={'class': 'comment-count'}), # dict(name='div', attrs={'class': 'item-list share-links'}), # dict(name='footer'), # ] # remove_attributes = ['border', 'cellspacing', 'align', 'cellpadding', 'colspan', 'valign', 'vspace', 'hspace', #'alt', 'width', 'height', 'style'] # extra_css = 'img {max-width: 100%; display: block; margin: auto;} .captioned-image div {text-align: center; #font-style: italic;}' feeds = [ (u'net', u'http://feeds.feedburner.com/net/topstories') ] |
Advert | |
|
05-10-2013, 12:01 AM | #3 |
Enthusiast
Posts: 32
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Dell 2in1
|
Got it, at least it's almost a perfect recipe. Right now it still show the comments when they are there, but most of the top and bottom have been eliminated. I'd love to figure out how to use remove_tags_before when there is no id or class for it --> <header> is how the header is designated and I couldn't get it to work right, ended up using remove_tags for almost everything.
Here is the recipe at this point in time. Code:
from calibre.web.feeds.news import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, BeautifulStoneSoup class dotnetMagazine (BasicNewsRecipe): __author__ = u'Bonni Salles - post in forum if questions for me' __version__ = '1.0' __license__ = 'GPL v3' __copyright__ = u'2013, Bonni Salles' title = '.net ' oldest_article = 7 no_stylesheets = True encoding = 'utf8' use_embedded_content = False language = 'en' remove_empty_feeds = True extra_css = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} ' remove_tags_after = [ dict(name='div', attrs={'class': 'footer-content'}), ] #remove_tags_before = [ # dict(name='div', attrs={'id': 'main-content'}), # ] remove_tags = [ dict(name='div', attrs={'class': 'item-list'}), dict(name=['header','footer']), dict(attrs={'class':re.compile('(^|| )menu($|| )', re.DOTALL)}), dict(name='h4', attrs={'class': 'std-hdr'}), dict(name=['script', 'noscript']), dict(name='div', attrs={'id': 'comments-form'}), dict(name='div', attrs={'id': re.compile('advertorial_block_($|| )')}), dict(name='div', attrs={'id': 'right-col'}), ] # remove_attributes = ['border', 'cellspacing', 'align', 'cellpadding', 'colspan', 'valign', 'vspace', 'hspace', #'alt', 'width', 'height', 'style'] # extra_css = 'img {max-width: 100%; display: block; margin: auto;} .captioned-image div {text-align: center; #font-style: italic;}' feeds = [ (u'net', u'http://feeds.feedburner.com/net/topstories') ] Here is the area of the article that I'm trying to work with Code:
</ul> </nav> </div> <div id="main-content"> <div id="content" > <article class="node node-news sticky" > <header> <h1 class="title"><span>GAAD 2013 wants accessibility on web devs' minds</span></h1> <div class="submitted" > By <span class="author-name">Craig Grannell</span> on <time datetime="2013-05-09T11:35:55+00:00" >May 09, 2013</time> <div class="item-list share-links" ><h3>Share this article</h3><ul><li class="twitter-button first"><a href="http://twitter.com/share?url=http://www.netmagazine.com/news/gaad-2013-wants-accessibility-web-devs-minds-132742&text=GAAD%202013%20wants%20accessibility%20on%20web%20devs%27%20minds | .net" class="twitter-share-button" data-count="none">Tweet</a></li> <li class="facebook-button"><iframe src="http://www.facebook.com/plugins/like.php?href=http://www.netmagazine.com/news/gaad-2013-wants-accessibility-web-devs-minds-132742&send=false&layout=button_count&width=47&show_faces=false&action=like&colorscheme=light&font=arial&height=21" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:47px; height:21px;" allowTransparency="true"></iframe></li> <li class="googleplus-button"><g:plusone size="medium" count="false" ></g:plusone></li> <li class="linkedin-button"><script type="in/share" ></script> </li> <li class="shorturl-button inactive last"><div class="shorturl-box"><input type="text" value="http://netm.ag/15MLtx1" /><div class="shorturl-close"></div></div><span class="shorturl-link">Short url</span></li> </ul></div> </div> </header> <div class="content"> |
05-10-2013, 12:02 AM | #4 |
Enthusiast
Posts: 32
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Dell 2in1
|
Success in creating a recipe to handle .net
Got it, at least it's almost a perfect recipe. Right now it still show the comments when they are there, but most of the top and bottom have been eliminated. I'd love to figure out how to use remove_tags_before when there is no id or class for it --> <header> is how the header is designated and I couldn't get it to work right, ended up using remove_tags for almost everything.
Here is the recipe at this point in time. Code:
from calibre.web.feeds.news import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, BeautifulStoneSoup class dotnetMagazine (BasicNewsRecipe): __author__ = u'Bonni Salles - post in forum if questions for me' __version__ = '1.0' __license__ = 'GPL v3' __copyright__ = u'2013, Bonni Salles' title = '.net ' oldest_article = 7 no_stylesheets = True encoding = 'utf8' use_embedded_content = False language = 'en' remove_empty_feeds = True extra_css = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} ' remove_tags_after = [ dict(name='div', attrs={'class': 'footer-content'}), ] #remove_tags_before = [ # dict(name='div', attrs={'id': 'main-content'}), # ] remove_tags = [ dict(name='div', attrs={'class': 'item-list'}), dict(name=['header','footer']), dict(attrs={'class':re.compile('(^|| )menu($|| )', re.DOTALL)}), dict(name='h4', attrs={'class': 'std-hdr'}), dict(name=['script', 'noscript']), dict(name='div', attrs={'id': 'comments-form'}), dict(name='div', attrs={'id': re.compile('advertorial_block_($|| )')}), dict(name='div', attrs={'id': 'right-col'}), ] # remove_attributes = ['border', 'cellspacing', 'align', 'cellpadding', 'colspan', 'valign', 'vspace', 'hspace', #'alt', 'width', 'height', 'style'] # extra_css = 'img {max-width: 100%; display: block; margin: auto;} .captioned-image div {text-align: center; #font-style: italic;}' feeds = [ (u'net', u'http://feeds.feedburner.com/net/topstories') ] Here is the area of the article that I'm trying to work with Code:
</ul> </nav> </div> <div id="main-content"> <div id="content" > <article class="node node-news sticky" > <header> <h1 class="title"><span>GAAD 2013 wants accessibility on web devs' minds</span></h1> <div class="submitted" > By <span class="author-name">Craig Grannell</span> on <time datetime="2013-05-09T11:35:55+00:00" >May 09, 2013</time> <div class="item-list share-links" ><h3>Share this article</h3><ul><li class="twitter-button first"><a href="http://twitter.com/share?url=http://www.netmagazine.com/news/gaad-2013-wants-accessibility-web-devs-minds-132742&text=GAAD%202013%20wants%20accessibility%20on%20web%20devs%27%20minds | .net" class="twitter-share-button" data-count="none">Tweet</a></li> <li class="facebook-button"><iframe src="http://www.facebook.com/plugins/like.php?href=http://www.netmagazine.com/news/gaad-2013-wants-accessibility-web-devs-minds-132742&send=false&layout=button_count&width=47&show_faces=false&action=like&colorscheme=light&font=arial&height=21" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:47px; height:21px;" allowTransparency="true"></iframe></li> <li class="googleplus-button"><g:plusone size="medium" count="false" ></g:plusone></li> <li class="linkedin-button"><script type="in/share" ></script> </li> <li class="shorturl-button inactive last"><div class="shorturl-box"><input type="text" value="http://netm.ag/15MLtx1" /><div class="shorturl-close"></div></div><span class="shorturl-link">Short url</span></li> </ul></div> </div> </header> <div class="content"> |
05-10-2013, 02:04 AM | #5 |
creator of calibre
Posts: 43,776
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
remove_tags_before = dict(name='header', id=lambda x:not x)
will match a <header> tag with no id. |
Advert | |
|
05-11-2013, 09:13 PM | #6 |
Enthusiast
Posts: 32
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Dell 2in1
|
Success
Kovid, thank you, that's what was needed.
The recipe is now fixed and works. Here is the final version if you want to use it in the program. Spoiler:
Last edited by Camper65; 05-11-2013 at 11:39 PM. |
05-26-2013, 08:50 AM | #7 |
Enthusiast
Posts: 32
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Dell 2in1
|
New issue and I'm not sure it's just a problem this week or not. I had to add recursion = 1 to force it to download the article. The feed site now has an ad page that apparently comes up first and then you have to "Click here to continue to article". How can I have it automatically avoid that first page or in other words, get the right link from that ad page?
(here's the modified recipe to try) Spoiler:
|
05-26-2013, 09:04 AM | #8 |
creator of calibre
Posts: 43,776
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You need to use the obfuscated articles infrastructure in the recipe. articles_are_obfuscated = True and them implement get_obfuscated_article() in your recipe.
|
05-26-2013, 02:15 PM | #9 |
Enthusiast
Posts: 32
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Dell 2in1
|
Kovid,
I added the following to the test recipe and it's not working. At the top end with the other true/false, etc. area articles_are_obfuscated = True Just before the feeds: def get_obfuscated_article(self, url): raise NotImplementedError and it get the following in the recipe.txt printout: Spoiler:
What am I missing in the middle of these lines? def get_obfuscated_article(self, url): raise NotImplementedError |
05-26-2013, 11:32 PM | #10 |
creator of calibre
Posts: 43,776
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Look ata builtin recipe that uses get_obfuscated_article or read the API documentation for that function.
|
05-27-2013, 11:16 AM | #11 |
Enthusiast
Posts: 32
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Dell 2in1
|
Got it
Got the corrected one again. But give me a week or two to make sure that the changes feedsportal made are permanent and not just a fluke, since I now only download this once a week (they only produce articles five days a week and it's better downloading Saturday or Sunday for it to get that week's articles.
I'll let you know next week hopefully, in between installing components to my new tower I'm building. Code:
from calibre.web.feeds.news import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, BeautifulStoneSoup from calibre.web.feeds import feed_from_xml, templates, feeds_from_index, Feed from calibre.ptempfile import PersistentTemporaryFile class dotnetMagazine (BasicNewsRecipe): __author__ = u'Bonni Salles - post in forum if questions for me' __version__ = '1.1' __license__ = 'GPL v3' __copyright__ = u'2013, Bonni Salles' title = '.net ' oldest_article = 7 no_stylesheets = True encoding = 'utf8' use_embedded_content = False recursion = 1 articles_are_obfuscated = True language = 'en' remove_empty_feeds = True extra_css = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} ' cover_url = u'http://media.netmagazine.futurecdn.net/sites/all/themes/netmag/logo.png' remove_tags_after = dict(name='footer', id=lambda x:not x) remove_tags_before = dict(name='header', id=lambda x:not x) remove_tags = [ dict(name='div', attrs={'class': 'item-list'}), dict(name='h4', attrs={'class': 'std-hdr'}), dict(name='div', attrs={'class': 'item-list share-links'}), #removes share links dict(name=['script', 'noscript']), dict(name='div', attrs={'id': 'comments-form'}), #comment these out if you want the comments to show dict(name='div', attrs={'id': re.compile('advertorial_block_($|| )')}), dict(name='div', attrs={'id': 'right-col'}), dict(name='div', attrs={'id': 'comments'}), #comment these out if you want the comments to show dict(name='div', attrs={'class': 'item-list related-content'}), ] feeds = [ (u'net', u'http://feeds.feedburner.com/net/topstories?format=xml') ] temp_files = [] def get_obfuscated_article(self, url): br = self.get_browser() print 'THE CURRENT URL IS: ', url br.open(url) response = br.open(url) html = response.read() self.temp_files.append(PersistentTemporaryFile('_fa.html')) self.temp_files[-1].write(html) self.temp_files[-1].close() return self.temp_files[-1].name Camper65 |
05-27-2013, 11:38 AM | #12 |
creator of calibre
Posts: 43,776
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Dont you need to actually parse the returned html to see if it contains an ad and find the correct article in that case? Then you dont need recursion = 1 any more.
|
05-27-2013, 12:30 PM | #13 |
Enthusiast
Posts: 32
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Dell 2in1
|
Perhaps I didn't explain the initial change properly. When you click on a feed initially (at least the first time I did it) it took me to a page with an ad in the middle and on the right hand side it had in multiple languages "Click here to continue to article" which when you click takes you directly to the article. All I saw when this week's .net downloaded was a batch of pages of the multiple languages "Click here to continue to article" (with no text for the article).
When I added Recursion = 1, it gave me the first page with the right hand side info of the multiple languages "Click here to continue to article" and after that page it gave me the actual article as a new entry. But in between each article was again the click here to continue to article page. This now gets the articles only. But that's why I'm waiting a week to see if the feed changes are permanent before I'm sure the changes are needed. If you know of another way to get around this double clicking to get the article let me know please. If you want me to send you an epub so you an see what it is doing, let me know. |
05-27-2013, 12:48 PM | #14 |
creator of calibre
Posts: 43,776
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Ah, I see, just ad this to your recipe:
Code:
def skip_ad_pages(self, soup): text = soup.find(text='click here to continue to article') if text: a = text.parent url = a.get('href') if url: return self.index_to_soup(url, raw=True) |
05-27-2013, 04:09 PM | #15 |
Enthusiast
Posts: 32
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Dell 2in1
|
Kovid,
Thank you!!!! That did it. The articles now download like normal and do not have that extra page in there. Here is the updated recipe again so you can use it next time you do updates to Calibre. Also, thank you for creating such a great ebook organizer/news download program. Code:
from calibre.web.feeds.news import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, BeautifulStoneSoup from calibre.web.feeds import feed_from_xml, templates, feeds_from_index, Feed from calibre.ptempfile import PersistentTemporaryFile class dotnetMagazine (BasicNewsRecipe): __author__ = u'Bonni Salles - post in forum if questions for me' __version__ = '1.1' __license__ = 'GPL v3' __copyright__ = u'2013, Bonni Salles' title = '.net magazine' oldest_article = 7 no_stylesheets = True encoding = 'utf8' use_embedded_content = False #recursion = 1 language = 'en' remove_empty_feeds = True extra_css = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} ' cover_url = u'http://media.netmagazine.futurecdn.net/sites/all/themes/netmag/logo.png' remove_tags_after = dict(name='footer', id=lambda x:not x) remove_tags_before = dict(name='header', id=lambda x:not x) remove_tags = [ dict(name='div', attrs={'class': 'item-list'}), dict(name='h4', attrs={'class': 'std-hdr'}), dict(name='div', attrs={'class': 'item-list share-links'}), #removes share links dict(name=['script', 'noscript']), dict(name='div', attrs={'id': 'comments-form'}), #comment these out if you want the comments to show dict(name='div', attrs={'id': re.compile('advertorial_block_($|| )')}), dict(name='div', attrs={'id': 'right-col'}), dict(name='div', attrs={'id': 'comments'}), #comment these out if you want the comments to show dict(name='div', attrs={'class': 'item-list related-content'}), ] feeds = [ (u'net', u'http://feeds.feedburner.com/net/topstories?format=xml') ] def skip_ad_pages(self, soup): text = soup.find(text='click here to continue to article') if text: a = text.parent url = a.get('href') if url: return self.index_to_soup(url, raw=True) |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
.net recipe suddenly not working right | Camper65 | Recipes | 0 | 04-21-2013 01:16 PM |
.net magazine recipe | cram1010 | Recipes | 0 | 07-21-2012 09:26 AM |
Modified Recipe Tweakers.net - need help | roedi06 | Recipes | 4 | 01-17-2012 07:42 AM |
recipe for FAZ.net - german | schuster | Recipes | 10 | 05-28-2011 12:13 AM |
Request: Inquirer.net Recipe update | zoilom | Recipes | 0 | 12-21-2010 01:06 AM |