05-27-2010, 10:36 PM | #1996 |
Member
Posts: 16
Karma: 10
Join Date: May 2010
Location: Southern California
Device: JetBook-Lite
|
Help with recipe - articles span more then 1 page
Hello everyone.
I need some help with a recipe for this feed: http://www.pcper.com/rss/articles.rss Most of the articles span several pages, I've cleaned it up a bit but I'm not sure how to scrape the complete article from the "Click here for the Detailed Review" links. Thanks! Here's what I have so far. Code:
class AdvancedUserRecipe1274998412(BasicNewsRecipe): title = u'PC Perspective Articles' description = 'PC Perspective Articles' __author__ = 'KidTwisted' #use_embedded_content = False max_articles_per_feed = 25 oldest_article = 7 cover_url = 'http://www.pcper.com/site_gfx/pcpheader_02.gif' no_stylesheets = True language = 'en' remove_javascript = True conversion_options = { 'linearize_tables' : True} # reverse_article_order = True remove_tags = [dict(name='table', attrs={'class':'topwrapper'}), dict(name='div', attrs={'class':'leftcatimg'}), dict(name='div', attrs={'class':'navcontainer1'}), dict(name='td', attrs={'class':'img3'}), dict(name='div', attrs={'class':'mtbg'}), dict(name='div', attrs={'class':'rightcatimg'}), dict(name='td', attrs={'class':'articlelinks'}), dict(id='navcontainer')] remove_tags_after = dict(name='div', attrs={'class':'rightcatimg'}) feeds = [ (u'PC Perspective Articles', u'http://www.pcper.com/rss/articles.rss') ] Last edited by kidtwisted; 05-28-2010 at 12:04 AM. |
05-28-2010, 02:47 AM | #1997 |
Connoisseur
Posts: 98
Karma: 22
Join Date: Mar 2010
Device: IRiver Story, Ipod Touch, Android SmartPhone
|
minor updated recipe
leggo (it)
The first page is often the third page, because the first two page are advertises. corriere_della_sera Italian News Paper: now with the first page of the newspaper. |
Advert | |
|
05-28-2010, 08:52 AM | #1998 |
Member
Posts: 11
Karma: 10
Join Date: Feb 2010
Device: Kindle
|
|
05-28-2010, 09:34 AM | #1999 |
Connoisseur
Posts: 98
Karma: 22
Join Date: Mar 2010
Device: IRiver Story, Ipod Touch, Android SmartPhone
|
with this feed
http://www3.lastampa.it/fotografia/feedrss.xml/ i have this output: Code:
{'summary_detail': {'base': '', 'type': 'text/html', 'value': u'Chi si occupa di fotografia oggi non pu\xf2 prescindere dalle straordinarie opportunit\xe0 messe a... <img width="1" height="1" src="http://rss.feedsportal.com/c/32418/f/478449/s/ac78abb/mf.gif" border="0" /> <div class="mf-viral"> <table border="0"> <tr> <td valign="middle"> <a href="http://res.feedsportal.com/viral/sendemail2_it.html?title=Caricare+video+fotografici+su+Photographers.it&link=http%3A%2F%2Fwww3.lastampa.it%2Ffotografia%2Fnotizie-brevi%2Farticolo%2Flstp%2F230412%2F" target="_blank"> <img src="http://rss.feedsportal.com/images/emailthis2_it.gif" border="0" /> </a> </td> <td valign="middle"> <a href="http://res.feedsportal.com/viral/bookmark_it.cfm?title=Caricare+video+fotografici+su+Photographers.it&link=http%3A%2F%2Fwww3.lastampa.it%2Ffotografia%2Fnotizie-brevi%2Farticolo%2Flstp%2F230412%2F" target="_blank"><img src="http://rss.feedsportal.com/images/bookmark_it.gif" border="0" /> </a> </td> </tr> </table></div><br /><br /><a href="http://da.feedsportal.com/r/72644213708/u/2/f/478449/c/32418/s/180849339/a2.htm"><img src="http://da.feedsportal.com/r/72644213708/u/2/f/478449/c/32418/s/180849339/a2.img" border="0" /></a>', 'language': None}, 'updated_parsed': time.struct_time(tm_year=2010, tm_mon=5, tm_mday=26, tm_hour=22, tm_min=0, tm_sec=0, tm_wday=2, tm_yday=146, tm_isdst=0), 'links': [{'href': u'http://rss.feedsportal.com/c/32418/f/478449/s/ac78abb/l/0Lwww30Blastampa0Bit0Cfotografia0Cnotizie0Ebrevi0Carticolo0Clstp0C230A4120C/story01.htm', 'type': 'text/html', 'rel': 'alternate'}], 'title': u'Caricare video fotografici su Photographers.it', 'tags': [{'term': u'Notizie Brevi', 'scheme': None, 'label': None}], 'updated': u'Wed, 26 May 2010 22:00:00 GMT', 'summary': u'Chi si occupa di fotografia oggi non pu\xf2 prescindere dalle straordinarie opportunit\xe0 messe a...<img width="1" height="1" src="http://rss.feedsportal.com/c/32418/f/478449/s/ac78abb/mf.gif" border="0" /><div class="mf-viral"><table border="0"><tr><td valign="middle"><a href="http://res.feedsportal.com/viral/sendemail2_it.html?title=Caricare+video+fotografici+su+Photographers.it&link=http%3A%2F%2Fwww3.lastampa.it%2Ffotografia%2Fnotizie-brevi%2Farticolo%2Flstp%2F230412%2F" target="_blank"><img src="http://rss.feedsportal.com/images/emailthis2_it.gif" border="0" /></a></td><td valign="middle"><a href="http://res.feedsportal.com/viral/bookmark_it.cfm?title=Caricare+video+fotografici+su+Photographers.it&link=http%3A%2F%2Fwww3.lastampa.it%2Ffotografia%2Fnotizie-brevi%2Farticolo%2Flstp%2F230412%2F" target="_blank"><img src="http://rss.feedsportal.com/images/bookmark_it.gif" border="0" /></a></td></tr></table></div><br /><br /><a href="http://da.feedsportal.com/r/72644213708/u/2/f/478449/c/32418/s/180849339/a2.htm"><img src="http://da.feedsportal.com/r/72644213708/u/2/f/478449/c/32418/s/180849339/a2.img" border="0" /></a>', 'content': [{'base': '', 'type': 'text/html', 'value': u'<br />Chi si occupa di fotografia oggi non pu\xf2 prescindere dalle straordinarie opportunit\xe0 messe a disposizione dai nuovi media.<br />Ecco perch\xe8 <a href="http://www.photographers.it%20">www.photographers.it</a> mette a disposizione dei suoi utenti anche un nuovo editor che consente il caricamento dei propri video contemporaneamente sul portale e sul suo <a href="http://www.youtube.com/user/photographersit">Canale YouTube</a>.<br /> <br />Se realizzate slideshow fotografici, multimedia artistici o tutorial, se volete promuovere la vostra professionalit\xe0 o mostrare il backstage di shooting e produzioni, cosa aspettate? Per effettuare l\'upload video \xe8 sufficiente essere registrati a Photographers.it.<br /><br /><br /><br /><br /><span><a target="_top" href="http://www.photographers.it/free/redazione">[ Redazione By photographers.it ]</a></span></b></span><img width="1" height="1" src="http://rss.feedsportal.com/c/32418/f/478449/s/ac78abb/mf.gif" border="0" /><div class="mf-viral"><table border="0"><tr><td valign="middle"><a href="http://res.feedsportal.com/viral/sendemail2_it.html?title=Caricare+video+fotografici+su+Photographers.it&link=http%3A%2F%2Fwww3.lastampa.it%2Ffotografia%2Fnotizie-brevi%2Farticolo%2Flstp%2F230412%2F" target="_blank"><img src="http://rss.feedsportal.com/images/emailthis2_it.gif" border="0" /></a></td><td valign="middle"><a href="http://res.feedsportal.com/viral/bookmark_it.cfm?title=Caricare+video+fotografici+su+Photographers.it&link=http%3A%2F%2Fwww3.lastampa.it%2Ffotografia%2Fnotizie-brevi%2Farticolo%2Flstp%2F230412%2F" target="_blank"><img src="http://rss.feedsportal.com/images/bookmark_it.gif" border="0" /></a></td></tr></table></div><br /><br /><a href="http://da.feedsportal.com/r/72644213708/u/2/f/478449/c/32418/s/180849339/a2.htm"><img src="http://da.feedsportal.com/r/72644213708/u/2/f/478449/c/32418/s/180849339/a2.img" border="0" /></a>', 'language': None}], 'guidislink': False, 'title_detail': {'base': '', 'type': 'text/plain', 'value': u'Caricare video fotografici su Photographers.it', 'language': None}, 'link': u'http://rss.feedsportal.com/c/32418/f/478449/s/ac78abb/l/0Lwww30Blastampa0Bit0Cfotografia0Cnotizie0Ebrevi0Carticolo0Clstp0C230A4120C/story01.htm', 'id': u'http://rss.feedsportal.com/c/32418/f/478449/s/ac78abb/l/0Lwww30Blastampa0Bit0Cfotografia0Cnotizie0Ebrevi0Carticolo0Clstp0C230A4120C/story01.htm'} with this feed: http://www.lastampa.it/redazione/cms...s_politica.xml Code:
{'summary_detail': {'base': '', 'type': 'text/html', 'value': u'MILANO<br />\xabLa manovra economica da 24 miliardi ci consente di tenere la nave in rotta, senza aver messo le mani nelle tasche degli italiani\xbb. Silvio Berlusconi, ospite della trasmissione "Mattino Cinque" nello spazio di Maurizio Belpietro, difende la manovra e assicura che governo e maggioranza la sostengono senza incrinature. <br /><br />\xabServiva una risposta immediata e il governo, che \xe8 coeso, l\'ha ...(continua)', 'language': None }, 'updated_parsed': time.struct_time(tm_year=2010, tm_mon=5, tm_mday=28, tm_hour=10, tm_min=25, tm_sec=40, tm_wday=4, tm_yday=148, tm_isdst=0), 'links': [{'href': u'http://www.lastampa.it/redazione/cmsSezioni/politica/201005articoli/55437girata.asp', 'type': 'text/html', 'rel': 'alternate'}, {'type': 'text/html', 'rel': 'alternate'} ], 'author': u'', 'image': { 'height': 0, 'width': 0, 'href': u'http://www.lastampa.it/redazione/cmssezioni/politica/201005images/berlusconi10g.jpg', 'link': u'', 'title': u''}, 'tags': [{'term': u'POLITICA', 'scheme': None, 'label': None}], 'updated': u'Fri, 28 May 2010 12:25:40 +0200', 'summary': u'MILANO<br />\xabLa manovra economica da 24 miliardi ci consente di tenere la nave in rotta, senza aver messo le mani nelle tasche degli italiani\xbb. Silvio Berlusconi, ospite della trasmissione "Mattino Cinque" nello spazio di Maurizio Belpietro, difende la manovra e assicura che governo e maggioranza la sostengono senza incrinature. <br /><br />\xabServiva una risposta immediata e il governo, che \xe8 coeso, l\'ha ...(continua)', 'title_detail': {'base': '', 'type': 'text/plain', 'value': u'', 'language': None}, 'href': u'http://www.lastampa.it/redazione/cmssezioni/politica/201005images/berlusconi10g.jpg', 'link': u'', 'title': u'', 'id': u'http://www.lastampa.it/redazione/cmssezioni/politica/201005images/berlusconi10g.jpg', 'enclosures': [{ 'href': u'http://www.lastampa.it/redazione/cmssezioni/politica/201005images/berlusconi10g.jpg', 'type': u'image/jpeg'} ] } Last edited by gambarini; 05-30-2010 at 11:20 AM. |
05-28-2010, 03:19 PM | #2000 |
Junior Member
Posts: 7
Karma: 10
Join Date: Apr 2010
Device: sony
|
Thank you very much, worked like a charm
|
Advert | |
|
05-30-2010, 12:32 PM | #2001 |
Junior Member
Posts: 3
Karma: 10
Join Date: May 2010
Device: Amazon Kindle
|
I would like to request the recipe for a Catalan newspaper called Avui.
The RSS feed is this: http://www.avui.cat/cat/rss/totes_le...ui_cat_009.xml Thanks a lot in advance. See ya! |
05-30-2010, 01:35 PM | #2002 |
Addict
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
|
Fark.com Recipe request
Anyone happen to have a good working recipe for fark.com ? I love reading the bizarre stories they post on there. thanks
|
05-31-2010, 01:34 AM | #2003 |
Junior Member
Posts: 3
Karma: 10
Join Date: May 2010
Location: Calgary, AB, Canada
Device: iPad
|
Two Recipe Requests
Hello,
I started to attempt this myself, by I know from experience that I suck at python scripting (seems to be some sort of mental block). Anyway, I am moving from a Palm TX to an iPad and need to move two site scrapers to calibre. Both are html page scrapes (not rss). Here they are: http://www.macintouch.com/ I have been taking the main page and including the links to the reader reports. http://www.theregister.co.uk/week.html I would like to have this indexed by the dates on the page so the table of contents would have the dates with the articles as sub-titles (much like the way the one for the Calgary Herald works). It would also be great if it would also include the links that go to reghardware.com. This is definitely beyond my script-fu. These both would be greatly appreciated and save me (literally) days of futzing around trying to learn python. Thank you in advanced! |
05-31-2010, 12:40 PM | #2004 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Code:
def append_page(self, soup, appendtag, position): pager = soup.find('div',attrs={'class':'toolbar_fat_next'}) if pager: nexturl = self.INDEX + pager.a['href'] soup2 = self.index_to_soup(nexturl) texttag = soup2.find('div', attrs={'class':'bodytext'}) for it in texttag.findAll(style=True): del it['style'] newpos = len(texttag.contents) self.append_page(soup2,texttag,newpos) texttag.extract() appendtag.insert(position,texttag) def preprocess_html(self, soup): mtag = '<meta http-equiv="Content-Language" content="en-US"/>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>' soup.head.insert(0,mtag) for item in soup.findAll(style=True): del item['style'] self.append_page(soup, soup.body, 3) pager = soup.find('div',attrs={'class':'toolbar_fat'}) if pager: pager.extract() return soup preprocess_html uses append_page to modify the html. You'll need to look for the next page tag on your site and adjust accordingly. This should get you started. Do your testing with -vv and --test as in: ebook-convert pcper.recipe pcper --test -vv> pcper.txt |
|
05-31-2010, 12:43 PM | #2005 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
More info about whether there's a question and what it is might help me decide. |
|
05-31-2010, 12:50 PM | #2006 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
|
|
05-31-2010, 01:00 PM | #2007 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
|
|
05-31-2010, 01:23 PM | #2008 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Fark has an RSS feed, and I looked at it. It seems to have a one sentence description of an article on another site and a slew of comments. Do you just want the one sentence from Fark with the link, or do you want the comments? The content of the linked articles is probably too variable to easily add, as it comes from dozens of different sources, each with a different page structure. You'd get lots of junk with each one.
|
05-31-2010, 03:43 PM | #2009 |
Enthusiast
Posts: 33
Karma: 10
Join Date: May 2010
Device: Bookeen Cybook Gen3 Gold
|
Hello!
May I also ask for a recipe? http://www.sarajevo-x.com/rssfeeds A Bosnian news portal Thanks! |
05-31-2010, 04:37 PM | #2010 |
Connoisseur
Posts: 98
Karma: 22
Join Date: Mar 2010
Device: IRiver Story, Ipod Touch, Android SmartPhone
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Custom column read ? | pchrist7 | Calibre | 2 | 10-04-2010 02:52 AM |
Archive for custom screensavers | sleeplessdave | Amazon Kindle | 1 | 07-07-2010 12:33 PM |
How to back up preferences and custom recipes? | greenapple | Calibre | 3 | 03-29-2010 05:08 AM |
Donations for Custom Recipes | ddavtian | Calibre | 5 | 01-23-2010 04:54 PM |
Help understanding custom recipes | andersent | Calibre | 0 | 12-17-2009 02:37 PM |