Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Closed Thread
 
Thread Tools Search this Thread
Old 05-27-2010, 10:36 PM   #1996
kidtwisted
Member
kidtwisted began at the beginning.
 
kidtwisted's Avatar
 
Posts: 16
Karma: 10
Join Date: May 2010
Location: Southern California
Device: JetBook-Lite
Help with recipe - articles span more then 1 page

Hello everyone.

I need some help with a recipe for this feed:
http://www.pcper.com/rss/articles.rss

Most of the articles span several pages, I've cleaned it up a bit but I'm not sure how to scrape the complete article from the "Click here for the Detailed Review" links. Thanks!

Here's what I have so far.
Code:
class AdvancedUserRecipe1274998412(BasicNewsRecipe):
    title = u'PC Perspective  Articles'
    description = 'PC Perspective  Articles'
    __author__ = 'KidTwisted'
    #use_embedded_content   = False
    max_articles_per_feed = 25
    oldest_article = 7
    cover_url      = 'http://www.pcper.com/site_gfx/pcpheader_02.gif'

    no_stylesheets = True
    language = 'en'

    remove_javascript = True
    conversion_options = { 'linearize_tables' : True}
   # reverse_article_order = True

    remove_tags = [dict(name='table', attrs={'class':'topwrapper'}),
                            dict(name='div', attrs={'class':'leftcatimg'}),
                            dict(name='div', attrs={'class':'navcontainer1'}),
                            dict(name='td', attrs={'class':'img3'}),
                            dict(name='div', attrs={'class':'mtbg'}),
                            dict(name='div', attrs={'class':'rightcatimg'}),
                            dict(name='td', attrs={'class':'articlelinks'}),
                            dict(id='navcontainer')]

    remove_tags_after = dict(name='div', attrs={'class':'rightcatimg'})


    feeds =  [ (u'PC Perspective Articles', u'http://www.pcper.com/rss/articles.rss') ]

Last edited by kidtwisted; 05-28-2010 at 12:04 AM.
kidtwisted is offline  
Old 05-28-2010, 02:47 AM   #1997
gambarini
Connoisseur
gambarini began at the beginning.
 
Posts: 98
Karma: 22
Join Date: Mar 2010
Device: IRiver Story, Ipod Touch, Android SmartPhone
minor updated recipe

leggo (it)
The first page is often the third page, because the first two page are advertises.

corriere_della_sera
Italian News Paper: now with the first page of the newspaper.
Attached Files
File Type: zip corriere_della_sera_it.zip (1.4 KB, 191 views)
File Type: zip Leggo_it.zip (1.0 KB, 181 views)
gambarini is offline  
Old 05-28-2010, 08:52 AM   #1998
cscannella
Member
cscannella began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Feb 2010
Device: Kindle
Quote:
Originally Posted by kiklop74 View Post
I'll look into that this weekend if time permits
A thank you from me, too. Wired is one of my favorite reads...
cscannella is offline  
Old 05-28-2010, 09:34 AM   #1999
gambarini
Connoisseur
gambarini began at the beginning.
 
Posts: 98
Karma: 22
Join Date: Mar 2010
Device: IRiver Story, Ipod Touch, Android SmartPhone
with this feed
http://www3.lastampa.it/fotografia/feedrss.xml/
i have this output:
Code:
{'summary_detail': {'base': '', 
                    'type': 'text/html', 
                    'value': u'Chi si occupa di fotografia oggi non pu\xf2 prescindere dalle straordinarie opportunit\xe0 messe a... <img width="1" height="1" src="http://rss.feedsportal.com/c/32418/f/478449/s/ac78abb/mf.gif" border="0" />
<div class="mf-viral">
<table border="0">
<tr>
    <td valign="middle">
    <a href="http://res.feedsportal.com/viral/sendemail2_it.html?title=Caricare+video+fotografici+su+Photographers.it&link=http%3A%2F%2Fwww3.lastampa.it%2Ffotografia%2Fnotizie-brevi%2Farticolo%2Flstp%2F230412%2F" target="_blank">

        <img src="http://rss.feedsportal.com/images/emailthis2_it.gif" border="0" />
    </a>
    </td>
    <td valign="middle">
      <a href="http://res.feedsportal.com/viral/bookmark_it.cfm?title=Caricare+video+fotografici+su+Photographers.it&link=http%3A%2F%2Fwww3.lastampa.it%2Ffotografia%2Fnotizie-brevi%2Farticolo%2Flstp%2F230412%2F" target="_blank"><img src="http://rss.feedsportal.com/images/bookmark_it.gif" border="0" />
      </a>
    </td>
</tr>
</table></div><br /><br /><a href="http://da.feedsportal.com/r/72644213708/u/2/f/478449/c/32418/s/180849339/a2.htm"><img src="http://da.feedsportal.com/r/72644213708/u/2/f/478449/c/32418/s/180849339/a2.img" border="0" /></a>', 
                    'language': None}, 
                    'updated_parsed': time.struct_time(tm_year=2010, tm_mon=5, tm_mday=26, tm_hour=22, tm_min=0, tm_sec=0, tm_wday=2, tm_yday=146, tm_isdst=0), 
                    'links': [{'href': u'http://rss.feedsportal.com/c/32418/f/478449/s/ac78abb/l/0Lwww30Blastampa0Bit0Cfotografia0Cnotizie0Ebrevi0Carticolo0Clstp0C230A4120C/story01.htm', 'type': 'text/html', 'rel': 'alternate'}], 'title': u'Caricare video fotografici su Photographers.it', 'tags': [{'term': u'Notizie Brevi', 'scheme': None, 'label': None}], 'updated': u'Wed, 26 May 2010 22:00:00 GMT', 'summary': u'Chi si occupa di fotografia oggi non pu\xf2 prescindere dalle straordinarie opportunit\xe0 messe a...<img width="1" height="1" src="http://rss.feedsportal.com/c/32418/f/478449/s/ac78abb/mf.gif" border="0" /><div class="mf-viral"><table border="0"><tr><td valign="middle"><a href="http://res.feedsportal.com/viral/sendemail2_it.html?title=Caricare+video+fotografici+su+Photographers.it&link=http%3A%2F%2Fwww3.lastampa.it%2Ffotografia%2Fnotizie-brevi%2Farticolo%2Flstp%2F230412%2F" target="_blank"><img src="http://rss.feedsportal.com/images/emailthis2_it.gif" border="0" /></a></td><td valign="middle"><a href="http://res.feedsportal.com/viral/bookmark_it.cfm?title=Caricare+video+fotografici+su+Photographers.it&link=http%3A%2F%2Fwww3.lastampa.it%2Ffotografia%2Fnotizie-brevi%2Farticolo%2Flstp%2F230412%2F" target="_blank"><img src="http://rss.feedsportal.com/images/bookmark_it.gif" border="0" /></a></td></tr></table></div><br /><br /><a href="http://da.feedsportal.com/r/72644213708/u/2/f/478449/c/32418/s/180849339/a2.htm"><img src="http://da.feedsportal.com/r/72644213708/u/2/f/478449/c/32418/s/180849339/a2.img" border="0" /></a>', 'content': [{'base': '', 'type': 'text/html', 'value': u'<br />Chi si occupa di fotografia oggi non pu\xf2 prescindere dalle straordinarie opportunit\xe0 messe a disposizione dai nuovi media.<br />Ecco perch\xe8 <a href="http://www.photographers.it%20">www.photographers.it</a> mette a disposizione dei suoi utenti anche un nuovo editor che consente il caricamento dei propri video contemporaneamente sul portale e sul suo <a href="http://www.youtube.com/user/photographersit">Canale YouTube</a>.<br /> <br />Se realizzate slideshow fotografici, multimedia artistici o tutorial, se volete promuovere la vostra professionalit\xe0 o mostrare il backstage di shooting e produzioni, cosa aspettate? Per effettuare l\'upload video \xe8 sufficiente essere registrati a Photographers.it.<br /><br /><br /><br /><br /><span><a target="_top" href="http://www.photographers.it/free/redazione">[ Redazione By photographers.it ]</a></span></b></span><img width="1" height="1" src="http://rss.feedsportal.com/c/32418/f/478449/s/ac78abb/mf.gif" border="0" /><div class="mf-viral"><table border="0"><tr><td valign="middle"><a href="http://res.feedsportal.com/viral/sendemail2_it.html?title=Caricare+video+fotografici+su+Photographers.it&link=http%3A%2F%2Fwww3.lastampa.it%2Ffotografia%2Fnotizie-brevi%2Farticolo%2Flstp%2F230412%2F" target="_blank"><img src="http://rss.feedsportal.com/images/emailthis2_it.gif" border="0" /></a></td><td valign="middle"><a href="http://res.feedsportal.com/viral/bookmark_it.cfm?title=Caricare+video+fotografici+su+Photographers.it&link=http%3A%2F%2Fwww3.lastampa.it%2Ffotografia%2Fnotizie-brevi%2Farticolo%2Flstp%2F230412%2F" target="_blank"><img src="http://rss.feedsportal.com/images/bookmark_it.gif" border="0" /></a></td></tr></table></div><br /><br /><a href="http://da.feedsportal.com/r/72644213708/u/2/f/478449/c/32418/s/180849339/a2.htm"><img src="http://da.feedsportal.com/r/72644213708/u/2/f/478449/c/32418/s/180849339/a2.img" border="0" /></a>', 'language': None}], 'guidislink': False, 'title_detail': {'base': '', 'type': 'text/plain', 'value': u'Caricare video fotografici su Photographers.it', 'language': None}, 'link': u'http://rss.feedsportal.com/c/32418/f/478449/s/ac78abb/l/0Lwww30Blastampa0Bit0Cfotografia0Cnotizie0Ebrevi0Carticolo0Clstp0C230A4120C/story01.htm', 'id': u'http://rss.feedsportal.com/c/32418/f/478449/s/ac78abb/l/0Lwww30Blastampa0Bit0Cfotografia0Cnotizie0Ebrevi0Carticolo0Clstp0C230A4120C/story01.htm'}
there is the link (an href that need only a replace ('%2F','/') but i am not able to get the correct link

with this feed:
http://www.lastampa.it/redazione/cms...s_politica.xml

Code:
{'summary_detail': {'base': '', 
                    'type': 'text/html', 
                    'value': u'MILANO<br />\xabLa manovra economica da 24 miliardi ci consente di tenere la nave in rotta, senza aver messo le mani nelle tasche degli italiani\xbb. Silvio Berlusconi, ospite della trasmissione "Mattino Cinque" nello spazio di Maurizio Belpietro, difende la manovra e assicura che governo e maggioranza la sostengono senza incrinature. <br /><br />\xabServiva una risposta immediata e il governo, che \xe8 coeso, l\'ha ...(continua)', 
                    'language': None
                    }, 
 'updated_parsed': time.struct_time(tm_year=2010, tm_mon=5, tm_mday=28, tm_hour=10, tm_min=25, tm_sec=40, tm_wday=4, tm_yday=148, tm_isdst=0), 
 'links': [{'href': u'http://www.lastampa.it/redazione/cmsSezioni/politica/201005articoli/55437girata.asp',                     
            'type': 'text/html', 'rel': 'alternate'}, 
            {'type': 'text/html', 'rel': 'alternate'}
          ], 
 'author': u'', 
 'image': {  'height': 0, 
             'width': 0, 
             'href': u'http://www.lastampa.it/redazione/cmssezioni/politica/201005images/berlusconi10g.jpg', 'link': u'', 'title': u''}, 
 'tags': [{'term': u'POLITICA', 'scheme': None, 'label': None}], 
 'updated': u'Fri, 28 May 2010 12:25:40 +0200', 
 'summary': u'MILANO<br />\xabLa manovra economica da 24 miliardi ci consente di tenere la nave in rotta, senza aver messo le mani nelle tasche degli italiani\xbb. Silvio Berlusconi, ospite della trasmissione "Mattino Cinque" nello spazio di Maurizio Belpietro, difende la manovra e assicura che governo e maggioranza la sostengono senza incrinature. <br /><br />\xabServiva una risposta immediata e il governo, che \xe8 coeso, l\'ha ...(continua)', 
 'title_detail': {'base': '', 'type': 'text/plain', 'value': u'', 'language': None}, 
 'href': u'http://www.lastampa.it/redazione/cmssezioni/politica/201005images/berlusconi10g.jpg', 
 'link': u'', 
 'title': u'', 
 'id': u'http://www.lastampa.it/redazione/cmssezioni/politica/201005images/berlusconi10g.jpg', 
 'enclosures': [{ 'href': u'http://www.lastampa.it/redazione/cmssezioni/politica/201005images/berlusconi10g.jpg', 
                  'type': u'image/jpeg'}
              ]
}
in this output i can't find the title.

Last edited by gambarini; 05-30-2010 at 11:20 AM.
gambarini is offline  
Old 05-28-2010, 03:19 PM   #2000
square4761
Junior Member
square4761 began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Apr 2010
Device: sony
Thank you very much, worked like a charm
square4761 is offline  
Old 05-30-2010, 12:32 PM   #2001
CeNoBiTa
Junior Member
CeNoBiTa began at the beginning.
 
Posts: 3
Karma: 10
Join Date: May 2010
Device: Amazon Kindle
I would like to request the recipe for a Catalan newspaper called Avui.

The RSS feed is this:

http://www.avui.cat/cat/rss/totes_le...ui_cat_009.xml

Thanks a lot in advance.

See ya!
CeNoBiTa is offline  
Old 05-30-2010, 01:35 PM   #2002
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Fark.com Recipe request

Anyone happen to have a good working recipe for fark.com ? I love reading the bizarre stories they post on there. thanks
TonytheBookworm is offline  
Old 05-31-2010, 01:34 AM   #2003
23n
Junior Member
23n began at the beginning.
 
Posts: 3
Karma: 10
Join Date: May 2010
Location: Calgary, AB, Canada
Device: iPad
Two Recipe Requests

Hello,

I started to attempt this myself, by I know from experience that I suck at python scripting (seems to be some sort of mental block).

Anyway, I am moving from a Palm TX to an iPad and need to move two site scrapers to calibre. Both are html page scrapes (not rss). Here they are:

http://www.macintouch.com/
I have been taking the main page and including the links to the reader reports.

http://www.theregister.co.uk/week.html
I would like to have this indexed by the dates on the page so the table of contents would have the dates with the articles as sub-titles (much like the way the one for the Calgary Herald works). It would also be great if it would also include the links that go to reghardware.com. This is definitely beyond my script-fu.

These both would be greatly appreciated and save me (literally) days of futzing around trying to learn python.

Thank you in advanced!
23n is offline  
Old 05-31-2010, 12:40 PM   #2004
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kidtwisted View Post
Hello everyone.

I need some help with a recipe for this feed:
http://www.pcper.com/rss/articles.rss

Most of the articles span several pages, I've cleaned it up a bit but I'm not sure how to scrape the complete article from the "Click here for the Detailed Review" links. Thanks!
You need to use multipage code. Here's an example from the adventuregamers.recipe builtin:

Code:
    def append_page(self, soup, appendtag, position):
        pager = soup.find('div',attrs={'class':'toolbar_fat_next'})
        if pager:
           nexturl = self.INDEX + pager.a['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'class':'bodytext'})
           for it in texttag.findAll(style=True):
               del it['style']
           newpos = len(texttag.contents)          
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)
        
    
    def preprocess_html(self, soup):
        mtag = '<meta http-equiv="Content-Language" content="en-US"/>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>'
        soup.head.insert(0,mtag)    
        for item in soup.findAll(style=True):
            del item['style']
        self.append_page(soup, soup.body, 3)
        pager = soup.find('div',attrs={'class':'toolbar_fat'})
        if pager:
           pager.extract()        
        return soup
append_page recursively looks for the next page tag ('div',attrs={'class':'toolbar_fat_next'}), gets the text and inserts it into the soup at the point where the tag was found until all pages have been inserted.

preprocess_html uses append_page to modify the html. You'll need to look for the next page tag on your site and adjust accordingly. This should get you started.

Do your testing with -vv and --test
as in:
ebook-convert pcper.recipe pcper --test -vv> pcper.txt
Starson17 is offline  
Old 05-31-2010, 12:43 PM   #2005
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by gambarini View Post
with this feed
http://www3.lastampa.it/fotografia/feedrss.xml/
i have this output:...

with this feed: ..

in this output i can't find the title.
I suspect there might be some questions here that I can help with.... but perhaps not

More info about whether there's a question and what it is might help me decide.
Starson17 is offline  
Old 05-31-2010, 12:50 PM   #2006
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by 23n View Post
http://www.macintouch.com/
I have been taking the main page and including the links to the reader reports.
If you want to try it yourself, this needs parse_index. Look here.
Starson17 is offline  
Old 05-31-2010, 01:00 PM   #2007
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by 23n View Post
http://www.theregister.co.uk/week.html
I would like to have this indexed by the dates on the page so the table of contents would have the dates with the articles as sub-titles (much like the way the one for the Calgary Herald works). It would also be great if it would also include the links that go to reghardware.com. This is definitely beyond my script-fu.
This web page reproduces the RSS feed (at least for the first 3 feeds I checked.) Calibre has a builtin recipe for The Register RSS feed. Why don't you look at that one first to see if it meets your needs.
Starson17 is offline  
Old 05-31-2010, 01:23 PM   #2008
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
Anyone happen to have a good working recipe for fark.com ? I love reading the bizarre stories they post on there. thanks
Fark has an RSS feed, and I looked at it. It seems to have a one sentence description of an article on another site and a slew of comments. Do you just want the one sentence from Fark with the link, or do you want the comments? The content of the linked articles is probably too variable to easily add, as it comes from dozens of different sources, each with a different page structure. You'd get lots of junk with each one.
Starson17 is offline  
Old 05-31-2010, 03:43 PM   #2009
Newby
Enthusiast
Newby began at the beginning.
 
Posts: 33
Karma: 10
Join Date: May 2010
Device: Bookeen Cybook Gen3 Gold
Hello!

May I also ask for a recipe?

http://www.sarajevo-x.com/rssfeeds

A Bosnian news portal

Thanks!
Newby is offline  
Old 05-31-2010, 04:37 PM   #2010
gambarini
Connoisseur
gambarini began at the beginning.
 
Posts: 98
Karma: 22
Join Date: Mar 2010
Device: IRiver Story, Ipod Touch, Android SmartPhone
Quote:
Originally Posted by Starson17 View Post
I suspect there might be some questions here that I can help with.... but perhaps not

More info about whether there's a question and what it is might help me decide.

thanks in advance
gambarini is offline  
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Custom column read ? pchrist7 Calibre 2 10-04-2010 02:52 AM
Archive for custom screensavers sleeplessdave Amazon Kindle 1 07-07-2010 12:33 PM
How to back up preferences and custom recipes? greenapple Calibre 3 03-29-2010 05:08 AM
Donations for Custom Recipes ddavtian Calibre 5 01-23-2010 04:54 PM
Help understanding custom recipes andersent Calibre 0 12-17-2009 02:37 PM


All times are GMT -4. The time now is 12:36 AM.


MobileRead.com is a privately owned, operated and funded community.