Custom recipes (archive, read-only) - Page 136

Starson17 · 06-02-2010, 07:59 AM

Quote:

Originally Posted by kidtwisted

Just a side thought to my previous post, both of those site use Article index drop down boxes

This means that links to all the pages you need are on the first page. You may have the option to grab them all there, or you can probably also build them recursively as the example code does. (I assume page 2 still has a links to page 3, etc. so recursive will still work).

Now, you want to know how to do it - right? If I get some time, I'll think about it. I did something similar with some Olympics recipes where I used regex matching to find URLs embedded inside a script.

I'd probably start the way I always do, and use preprocess_html and print the soup - then make sure that you are capturing the form and the multiple page links. Get the page links into a list. Then see if you can rewrite append_page to cycle through that list and build the new page, except you don't need to do it recursively as you've got all the links already in the list you're processing. (That's just off the top of my head.)

Starson17 · 06-02-2010, 11:00 AM

Quote:

Originally Posted by gambarini

Now it is completely clear the way;
first i must process the feed, and try to find title, description,date,url and then use these values to override the "calibre" automatic value.
it is not so simple (for me) to understand the correct way to do that and the correct sequence for every step of the process.

Yes, that's the basic idea I had. You might look at parse_feeds to help you build your article/feed list instead of building it entirely by hand from the soup of the RSS page. See here.

kidtwisted · 06-02-2010, 02:09 PM

Quote:

Originally Posted by Starson17

This means that links to all the pages you need are on the first page. You may have the option to grab them all there, or you can probably also build them recursively as the example code does. (I assume page 2 still has a links to page 3, etc. so recursive will still work).

Yes, tweaktown.com has all the links to the article on the 1st page within the article nav. box. But pcper.com puts their nav. box on the 2nd page of their articles so the recipe would need to check for a 2nd page first and if so, then scrap the the box for all the links.

Starson17 · 06-02-2010, 04:57 PM

Quote:

Originally Posted by kidtwisted

Could you or someone in the know take a look at it to see what I'm doing wrong.

I started to take a look at this, but I didn't get a feed at this location:
http://feeds.feedburner.com/Tweaktow...s20?format=xml

Perhaps it's my security settings?

Is this the right feed?

kidtwisted · 06-02-2010, 06:04 PM

It's a good feed...
http://feeds.feedburner.com/Tweaktow...s20?format=xml
so is the non-xml.
http://feeds.feedburner.com/Tweaktow...AndGuidesRss20

square4761 · 06-02-2010, 09:29 PM

http://townhall.com/
I copied dwanthny's custom recipe from the American Thinker. I replaced the sections with references for townhall instead of american thinker. It downloads the titles of articles but not the body of the article. There is no username/password to access the webpages. Any help would be greatly appreciated.

recipe:

__license__ = 'GPL v3'
__copyright__ = '2010, Firstname Lastname <emailaddress at domain.com>'
'''
http://townhall.com
'''
from calibre.web.feeds.news import BasicNewsRecipe

class Townhall(BasicNewsRecipe):
title = u'Townhall'
description = "Townhall is a daily internet publication devoted to the thoughtful exploration of issues of importance to Americans."
__author__ = 'Walt Anthony'
publisher = 'Thomas Lifson'
category = 'news, politics, USA'
oldest_article = 4 #days
max_articles_per_feed = 50
summary_length = 150
language = 'en'

remove_javascript = True
no_stylesheets = True

conversion_options = {
'comment' : description
, 'tags' : category
, 'publisher' : publisher
, 'language' : language
, 'linearize_tables' : True
}

remove_tags = [
dict(name=['table', 'iframe', 'embed', 'object'])
]

remove_tags_after = dict(name='div', attrs={'class':'article_body'})

feeds = [(u'http://rss.townhall.com/blogs/main'),
(u'http://rss.townhall.com/columnists/all')
]

def print_version(self, url):
return url + '?page=full'

DoctorOhh · 06-03-2010, 05:34 AM

Quote:

Originally Posted by square4761

remove_tags = [
dict(name=['table', 'iframe', 'embed', 'object'])
]

remove_tags_after = dict(name='div', attrs={'class':'article_body'})

feeds = [(u'http://rss.townhall.com/blogs/main'),
(u'http://rss.townhall.com/columnists/all')
]

def print_version(self, url):
return url + '?page=full'

First, It is bad etiquette not to mention just plain wrong to publish someone else's name and email to the web. Please take a minute to edit the above post and remove same.

Second, I looked in my working area and I had a recipe just about complete for the columnists but the blogs eluded me because they use java to print the blog entries. If you replace the above with the code below you will be in the ball park for the columnists feed.

I lost interest in it so when you manage to get it working take credit and submit it for others to use. I attached the favicon for the site that you can add to the zip file when you upload it here.

Good Luck.

Code:

    keep_only_tags = [
      dict(name='div', attrs={'class':'authorblock'}),
      dict(name='div', attrs={'id':'columnBody'})
    ]

    remove_tags_after   = dict(name='div', attrs={'id':'columnBody'})

    remove_tags  = [
       dict(name=['iframe', 'img', 'embed', 'object','center','script','form']),
       dict(name='div', attrs={'id':['ShareText', 'Externa', 'Toolbox', 'ctl00_cphMain_cbComments_dlComments_ctl01_ctl00_Content', 'ArticleContainer', 'shirttail', 'comments_container', 'ctl00_cphMain_cbComments_dvReadAll', 'footer']})

    ]


    feeds = [(u'TownHall Columnists', u'http://rss.townhall.com/columnists/all')]
    
    

    def print_version(self, url):
        return url + '&page=full'

Starson17 · 06-03-2010, 10:22 AM

Quote:

Originally Posted by kidtwisted

It's a good feed...

It was my security settings.

Starson17 · 06-03-2010, 11:04 AM

Quote:

Originally Posted by kidtwisted

It's a good feed...
http://feeds.feedburner.com/Tweaktow...s20?format=xml

Can you point me to a feed article that is multipage on this site? I've wandered around, but haven't seen one. Are you trying to get the photos under "See full gallery?"

kidtwisted · 06-03-2010, 01:19 PM

Quote:

Originally Posted by Starson17

Can you point me to a feed article that is multipage on this site? I've wandered around, but haven't seen one. Are you trying to get the photos under "See full gallery?"

basically both these sites do a good job on PC hardware reviews, my goal is to scrap the article/reviews weekly as an epub. most of them are multi-page and sometimes just one page, the photos are from the article that you see in the gallery, the gallery is not needed because they are in the article.

here is a multi-page article from the feed:
an 8 page PC case review
http://www.tweaktown.com/reviews/332...ent=FeedBurner

Look at the layout - an arrow button for the next page (1st target)
or the navigation box that contains all the links for the 8 pages.
I think scraping the nav. box would better cause that would also work for pcper.com

thanks

Starson17 · 06-03-2010, 02:54 PM

Quote:

Originally Posted by kidtwisted

Hey Starson17,
I'm kinda stuck, when I add the append_page code the test html only contains the feed description and date, with out it I get the 1st page so I'm screwing it up somewhere.

You're right - you screwed it up somewhere

Don't worry, you're in good company.

Quote:

Spoiler:

here's what I have for tweaktown.com:

Code:

class AdvancedUserRecipe1273795663(BasicNewsRecipe):
    title = u'TweakTown Latest Tech'
    description = 'TweakTown Latest Tech'
    __author__ = 'KidTwisted'
    publisher             = 'TweakTown'
    category              = 'PC Articles, Reviews and Guides'
    use_embedded_content   = False
    max_articles_per_feed = 1
    oldest_article = 7
    timefmt  = ' [%Y %b %d ]'
    no_stylesheets = True
    language = 'en'
    #recursion             = 10
    remove_javascript = True
    conversion_options = { 'linearize_tables' : True}
   # reverse_article_order = True
    #INDEX                 = u'http://www.tweaktown.com'

    html2lrf_options = [
                          '--comment', description
                        , '--category', category
                        , '--publisher', publisher
                        ]
    
    html2epub_options = 'publisher="' + publisher + '"\ncomments="' + description + '"\ntags="' + category + '"' 
	
    keep_only_tags = [dict(name='div', attrs={'id':['article']})]

    feeds =  [ (u'Articles Reviews', u'http://feeds.feedburner.com/TweaktownArticlesReviewsAndGuidesRss20?format=xml') ]

    def get_article_url(self, article):
        return article.get('guid',  None)
    
    def append_page(self, soup, appendtag, position):
        pager = soup.find('a',attrs={'class':'next'})
        if pager:
           nexturl = pager.a['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'id':'article'})
           for it in texttag.findAll(style=True):
               del it['style']
           newpos = len(texttag.contents)          
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)
        
    
    def preprocess_html(self, soup):
        mtag = '<meta http-equiv="Content-Language" content="en-US"/>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>'
        soup.head.insert(0,mtag)    
        for item in soup.findAll(style=True):
            del item['style']
        self.append_page(soup, soup.body, 3)
        pager = soup.find('a',attrs={'class':'next'})
        if pager:
           pager.extract()        
        return soup

Could you or someone in the know take a look at it to see what I'm doing wrong. I commented out "INDEX" because the link for the next page is a complete link, any help on this would be great.

The error is subtle. You did a good job of converting the sample code, but look at these lines from your code:

Code:

        pager = soup.find('a',attrs={'class':'next'})
        if pager:
           nexturl = pager.a['href']

Compare to the sample code:

Code:

        pager = soup.find('div',attrs={'class':'toolbar_fat_next'})
        if pager:
           nexturl = self.INDEX + pager.a['href']

In the sample, the next page link was inside an <a> tag which was, in turn, inside a <div> tag. The sample code searched for the <div> tag, then grabbed the <a> tag's "href" inside it. In your case, the <a> is marked with the class='next' so you didn't search for its parent, you searched directly for the <a> tag. That's fine, but then you copied the code that looked for an <a> tag inside the tag you found, and there wasn't one.

You need to change nexturl = pager.a['href'] to:

Code:

nexturl = pager['href']

Hold on .... let me test it .....

Yep - That does it. There's still lots of junk in my output, but it's definitely pulling multipages. My recipe may be slightly different from yours, but I think that should get you on your way.

kidtwisted · 06-03-2010, 04:40 PM

Quote:

Originally Posted by Starson17

You need to change nexturl = pager.a['href'] to:

Code:

nexturl = pager['href']

Ah OK! - lol noob mistake.

A question about preprocess_html part.
What does the "3" represent in this line?

Code:

self.append_page(soup, soup.body, 3)

Thanks for your help, just needs a little more clean up.
I need to apply this to the pcper.com site now , it's a little more tricky so it might need a different approach.

Thanks again.

Starson17 · 06-03-2010, 04:54 PM

Quote:

Originally Posted by kidtwisted

A question about preprocess_html part.
What does the "3" represent in this line?

Code:

self.append_page(soup, soup.body, 3)

It's "position" in the insert here: appendtag.insert(position,texttag)

It's saying to insert the text at the 3rd tag position. You can reference locations in Soup by labels (most common) or by tag position number (as above).

Quote:

Thanks for your help, just needs a little more clean up.
I need to apply this to the pcper.com site now , it's a little more tricky so it might need a different approach.

Thanks again.

You're welcome and good luck. I prefer to help others figure out how to do it than to just write it. If you need help with pcper, let us know, and be sure to post your final results here so Kovid can add it to the code for use by others.

Semonski · 06-03-2010, 10:23 PM

Once again I bow to the gurus! I could use some help on the Washington times recipe. I cobbled this one together below and it worked for quite some time, but now the Washington times has changed their format for their page..... any assistance would be greatly apperciated.

__license__ = 'GPL v3'

'''
washingtontimes.com
'''

from calibre.web.feeds.news import BasicNewsRecipe

class WashingtonTimes(BasicNewsRecipe):

title = 'Washington Times'
__author__ = 'Kos Semonski'
description = 'Daily newspaper'
publisher = 'News World Communications, Inc.'
category = 'news, politics, USA'
oldest_article = 2
max_articles_per_feed = 15
no_stylesheets = True
encoding = 'utf8'
use_embedded_content = False
language = 'en'
masthead_url = 'http://media.washingtontimes.com/media/img/TWTlogo.gif'
extra_css = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} '

conversion_options = {
'comment' : description
, 'tags' : category
, 'publisher' : publisher
, 'language' : language
}

def get_feeds(self):
return [(u'Headlines', u'http://www.washingtontimes.com/rss/headlines/news/headlines/'),
(u'Editor Favs', u'http://www.washingtontimes.com/rss/headlines/news/editor-favorites/'),
(u'Politics', u'http://www.washingtontimes.com/rss/headlines/news/politics/'),
(u'National', u'http://www.washingtontimes.com/rss/headlines/news/national/'),
(u'World', u'http://www.washingtontimes.com/rss/headlines/news/world/'),
(u'Business', u'http://www.washingtontimes.com/rss/headlines/news/business/'),
(u'Technology', u'http://www.washingtontimes.com/rss/headlines/news/technology/'),
(u'Editorials', u'http://www.washingtontimes.com/rss/headlines/opinion/editorials/')
]

def print_version(self, url):
return url + '/print/'

RLynker · 06-03-2010, 11:05 PM

My original post seems to have gotten caught in the fray so I will repost this. I apologize if I missed any responses. Thanks!

Quote:

Originally Posted by RLynker

Hello,

I apologize if I'm asking for something that has already been done, but I can't seem to find it no matter how I do a search through these 125 pages of postings. Nor is it in the list of included recipes in the latest version of Calibre.

I am trying to get a recipe for Maximum PC magazine's RSS feed. Their page is SO simple in the layout, but I've been unsuccessful trying to make a clean recipe for it. I just want the words on the page as they appear when you go to the below link. There's no need to go multiple levels into the hyperlinks. Does anyone have a recipe for this? The webpage for the full RSS (which could be applied to their individual ones as well since the format is identical) is:

http://www.maximumpc.com/articles/all/feed

Thank you very much!

RLynker

06-02-2010, 09:29 PM	#2031
square4761 Junior Member Posts: 7 Karma: 10 Join Date: Apr 2010 Device: sony	Townhall recipe http://townhall.com/ I copied dwanthny's custom recipe from the American Thinker. I replaced the sections with references for townhall instead of american thinker. It downloads the titles of articles but not the body of the article. There is no username/password to access the webpages. Any help would be greatly appreciated. recipe: __license__ = 'GPL v3' __copyright__ = '2010, Firstname Lastname <emailaddress at domain.com>' ''' http://townhall.com ''' from calibre.web.feeds.news import BasicNewsRecipe class Townhall(BasicNewsRecipe): title = u'Townhall' description = "Townhall is a daily internet publication devoted to the thoughtful exploration of issues of importance to Americans." __author__ = 'Walt Anthony' publisher = 'Thomas Lifson' category = 'news, politics, USA' oldest_article = 4 #days max_articles_per_feed = 50 summary_length = 150 language = 'en' remove_javascript = True no_stylesheets = True conversion_options = { 'comment' : description , 'tags' : category , 'publisher' : publisher , 'language' : language , 'linearize_tables' : True } remove_tags = [ dict(name=['table', 'iframe', 'embed', 'object']) ] remove_tags_after = dict(name='div', attrs={'class':'article_body'}) feeds = [(u'http://rss.townhall.com/blogs/main'), (u'http://rss.townhall.com/columnists/all') ] def print_version(self, url): return url + '?page=full' Last edited by zelda_pinwheel; 06-03-2010 at 08:57 AM. Reason: to remove personal information at request of member

06-03-2010, 10:23 PM	#2039
Semonski Junior Member Posts: 5 Karma: 10 Join Date: Mar 2010 Device: Kindle DX	Washington Times.... Once again I bow to the gurus! I could use some help on the Washington times recipe. I cobbled this one together below and it worked for quite some time, but now the Washington times has changed their format for their page..... any assistance would be greatly apperciated. __license__ = 'GPL v3' ''' washingtontimes.com ''' from calibre.web.feeds.news import BasicNewsRecipe class WashingtonTimes(BasicNewsRecipe): title = 'Washington Times' __author__ = 'Kos Semonski' description = 'Daily newspaper' publisher = 'News World Communications, Inc.' category = 'news, politics, USA' oldest_article = 2 max_articles_per_feed = 15 no_stylesheets = True encoding = 'utf8' use_embedded_content = False language = 'en' masthead_url = 'http://media.washingtontimes.com/media/img/TWTlogo.gif' extra_css = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} ' conversion_options = { 'comment' : description , 'tags' : category , 'publisher' : publisher , 'language' : language } def get_feeds(self): return [(u'Headlines', u'http://www.washingtontimes.com/rss/headlines/news/headlines/'), (u'Editor Favs', u'http://www.washingtontimes.com/rss/headlines/news/editor-favorites/'), (u'Politics', u'http://www.washingtontimes.com/rss/headlines/news/politics/'), (u'National', u'http://www.washingtontimes.com/rss/headlines/news/national/'), (u'World', u'http://www.washingtontimes.com/rss/headlines/news/world/'), (u'Business', u'http://www.washingtontimes.com/rss/headlines/news/business/'), (u'Technology', u'http://www.washingtontimes.com/rss/headlines/news/technology/'), (u'Editorials', u'http://www.washingtontimes.com/rss/headlines/opinion/editorials/') ] def print_version(self, url): return url + '/print/'

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 02:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 12:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 05:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 04:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 02:37 PM

06-02-2010, 06:04 PM	#2030
kidtwisted Member Posts: 16 Karma: 10 Join Date: May 2010 Location: Southern California Device: JetBook-Lite	It's a good feed... http://feeds.feedburner.com/Tweaktow...s20?format=xml so is the non-xml. http://feeds.feedburner.com/Tweaktow...AndGuidesRss20