Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Closed Thread
 
Thread Tools Search this Thread
Old 06-02-2010, 07:59 AM   #2026
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kidtwisted View Post
Just a side thought to my previous post, both of those site use Article index drop down boxes
This means that links to all the pages you need are on the first page. You may have the option to grab them all there, or you can probably also build them recursively as the example code does. (I assume page 2 still has a links to page 3, etc. so recursive will still work).

Now, you want to know how to do it - right? If I get some time, I'll think about it. I did something similar with some Olympics recipes where I used regex matching to find URLs embedded inside a script.

I'd probably start the way I always do, and use preprocess_html and print the soup - then make sure that you are capturing the form and the multiple page links. Get the page links into a list. Then see if you can rewrite append_page to cycle through that list and build the new page, except you don't need to do it recursively as you've got all the links already in the list you're processing. (That's just off the top of my head.)
Starson17 is offline  
Old 06-02-2010, 11:00 AM   #2027
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by gambarini View Post
Now it is completely clear the way;
first i must process the feed, and try to find title, description,date,url and then use these values to override the "calibre" automatic value.
it is not so simple (for me) to understand the correct way to do that and the correct sequence for every step of the process.
Yes, that's the basic idea I had. You might look at parse_feeds to help you build your article/feed list instead of building it entirely by hand from the soup of the RSS page. See here.
Starson17 is offline  
Old 06-02-2010, 02:09 PM   #2028
kidtwisted
Member
kidtwisted began at the beginning.
 
kidtwisted's Avatar
 
Posts: 16
Karma: 10
Join Date: May 2010
Location: Southern California
Device: JetBook-Lite
Quote:
Originally Posted by Starson17 View Post
This means that links to all the pages you need are on the first page. You may have the option to grab them all there, or you can probably also build them recursively as the example code does. (I assume page 2 still has a links to page 3, etc. so recursive will still work).
Yes, tweaktown.com has all the links to the article on the 1st page within the article nav. box. But pcper.com puts their nav. box on the 2nd page of their articles so the recipe would need to check for a 2nd page first and if so, then scrap the the box for all the links.
kidtwisted is offline  
Old 06-02-2010, 04:57 PM   #2029
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kidtwisted View Post
Could you or someone in the know take a look at it to see what I'm doing wrong.
I started to take a look at this, but I didn't get a feed at this location:
http://feeds.feedburner.com/Tweaktow...s20?format=xml

Perhaps it's my security settings?

Is this the right feed?
Starson17 is offline  
Old 06-02-2010, 06:04 PM   #2030
kidtwisted
Member
kidtwisted began at the beginning.
 
kidtwisted's Avatar
 
Posts: 16
Karma: 10
Join Date: May 2010
Location: Southern California
Device: JetBook-Lite
It's a good feed...
http://feeds.feedburner.com/Tweaktow...s20?format=xml
so is the non-xml.
http://feeds.feedburner.com/Tweaktow...AndGuidesRss20
kidtwisted is offline  
Old 06-02-2010, 09:29 PM   #2031
square4761
Junior Member
square4761 began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Apr 2010
Device: sony
Townhall recipe

http://townhall.com/
I copied dwanthny's custom recipe from the American Thinker. I replaced the sections with references for townhall instead of american thinker. It downloads the titles of articles but not the body of the article. There is no username/password to access the webpages. Any help would be greatly appreciated.

recipe:

__license__ = 'GPL v3'
__copyright__ = '2010, Firstname Lastname <emailaddress at domain.com>'
'''
http://townhall.com
'''
from calibre.web.feeds.news import BasicNewsRecipe

class Townhall(BasicNewsRecipe):
title = u'Townhall'
description = "Townhall is a daily internet publication devoted to the thoughtful exploration of issues of importance to Americans."
__author__ = 'Walt Anthony'
publisher = 'Thomas Lifson'
category = 'news, politics, USA'
oldest_article = 4 #days
max_articles_per_feed = 50
summary_length = 150
language = 'en'

remove_javascript = True
no_stylesheets = True


conversion_options = {
'comment' : description
, 'tags' : category
, 'publisher' : publisher
, 'language' : language
, 'linearize_tables' : True
}

remove_tags = [
dict(name=['table', 'iframe', 'embed', 'object'])
]

remove_tags_after = dict(name='div', attrs={'class':'article_body'})


feeds = [(u'http://rss.townhall.com/blogs/main'),
(u'http://rss.townhall.com/columnists/all')
]

def print_version(self, url):
return url + '?page=full'

Last edited by zelda_pinwheel; 06-03-2010 at 08:57 AM. Reason: to remove personal information at request of member
square4761 is offline  
Old 06-03-2010, 05:34 AM   #2032
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 8,781
Karma: 12516053
Join Date: Feb 2009
Location: North Carolina
Device: Nexus 7
Quote:
Originally Posted by square4761 View Post

remove_tags = [
dict(name=['table', 'iframe', 'embed', 'object'])
]

remove_tags_after = dict(name='div', attrs={'class':'article_body'})


feeds = [(u'http://rss.townhall.com/blogs/main'),
(u'http://rss.townhall.com/columnists/all')
]

def print_version(self, url):
return url + '?page=full'
First, It is bad etiquette not to mention just plain wrong to publish someone else's name and email to the web. Please take a minute to edit the above post and remove same.

Second, I looked in my working area and I had a recipe just about complete for the columnists but the blogs eluded me because they use java to print the blog entries. If you replace the above with the code below you will be in the ball park for the columnists feed.

I lost interest in it so when you manage to get it working take credit and submit it for others to use. I attached the favicon for the site that you can add to the zip file when you upload it here.

Good Luck.

Code:
    keep_only_tags = [
      dict(name='div', attrs={'class':'authorblock'}),
      dict(name='div', attrs={'id':'columnBody'})
    ]

    remove_tags_after   = dict(name='div', attrs={'id':'columnBody'})

    remove_tags  = [
       dict(name=['iframe', 'img', 'embed', 'object','center','script','form']),
       dict(name='div', attrs={'id':['ShareText', 'Externa', 'Toolbox', 'ctl00_cphMain_cbComments_dlComments_ctl01_ctl00_Content', 'ArticleContainer', 'shirttail', 'comments_container', 'ctl00_cphMain_cbComments_dvReadAll', 'footer']})

    ]


    feeds = [(u'TownHall Columnists', u'http://rss.townhall.com/columnists/all')]
    
    

    def print_version(self, url):
        return url + '&page=full'
Attached Images
 
DoctorOhh is offline  
Old 06-03-2010, 10:22 AM   #2033
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kidtwisted View Post
It's a good feed...
It was my security settings.
Starson17 is offline  
Old 06-03-2010, 11:04 AM   #2034
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kidtwisted View Post
Can you point me to a feed article that is multipage on this site? I've wandered around, but haven't seen one. Are you trying to get the photos under "See full gallery?"
Starson17 is offline  
Old 06-03-2010, 01:19 PM   #2035
kidtwisted
Member
kidtwisted began at the beginning.
 
kidtwisted's Avatar
 
Posts: 16
Karma: 10
Join Date: May 2010
Location: Southern California
Device: JetBook-Lite
Quote:
Originally Posted by Starson17 View Post
Can you point me to a feed article that is multipage on this site? I've wandered around, but haven't seen one. Are you trying to get the photos under "See full gallery?"
basically both these sites do a good job on PC hardware reviews, my goal is to scrap the article/reviews weekly as an epub. most of them are multi-page and sometimes just one page, the photos are from the article that you see in the gallery, the gallery is not needed because they are in the article.

here is a multi-page article from the feed:
an 8 page PC case review
http://www.tweaktown.com/reviews/332...ent=FeedBurner

Look at the layout - an arrow button for the next page (1st target)
or the navigation box that contains all the links for the 8 pages.
I think scraping the nav. box would better cause that would also work for pcper.com

thanks
kidtwisted is offline  
Old 06-03-2010, 02:54 PM   #2036
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kidtwisted View Post
Hey Starson17,
I'm kinda stuck, when I add the append_page code the test html only contains the feed description and date, with out it I get the 1st page so I'm screwing it up somewhere.
You're right - you screwed it up somewhere
Don't worry, you're in good company.
Quote:
Spoiler:

here's what I have for tweaktown.com:
Code:
class AdvancedUserRecipe1273795663(BasicNewsRecipe):
    title = u'TweakTown Latest Tech'
    description = 'TweakTown Latest Tech'
    __author__ = 'KidTwisted'
    publisher             = 'TweakTown'
    category              = 'PC Articles, Reviews and Guides'
    use_embedded_content   = False
    max_articles_per_feed = 1
    oldest_article = 7
    timefmt  = ' [%Y %b %d ]'
    no_stylesheets = True
    language = 'en'
    #recursion             = 10
    remove_javascript = True
    conversion_options = { 'linearize_tables' : True}
   # reverse_article_order = True
    #INDEX                 = u'http://www.tweaktown.com'

    html2lrf_options = [
                          '--comment', description
                        , '--category', category
                        , '--publisher', publisher
                        ]
    
    html2epub_options = 'publisher="' + publisher + '"\ncomments="' + description + '"\ntags="' + category + '"' 
	
    keep_only_tags = [dict(name='div', attrs={'id':['article']})]

    feeds =  [ (u'Articles Reviews', u'http://feeds.feedburner.com/TweaktownArticlesReviewsAndGuidesRss20?format=xml') ]

    def get_article_url(self, article):
        return article.get('guid',  None)
    
    def append_page(self, soup, appendtag, position):
        pager = soup.find('a',attrs={'class':'next'})
        if pager:
           nexturl = pager.a['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'id':'article'})
           for it in texttag.findAll(style=True):
               del it['style']
           newpos = len(texttag.contents)          
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)
        
    
    def preprocess_html(self, soup):
        mtag = '<meta http-equiv="Content-Language" content="en-US"/>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>'
        soup.head.insert(0,mtag)    
        for item in soup.findAll(style=True):
            del item['style']
        self.append_page(soup, soup.body, 3)
        pager = soup.find('a',attrs={'class':'next'})
        if pager:
           pager.extract()        
        return soup

Could you or someone in the know take a look at it to see what I'm doing wrong. I commented out "INDEX" because the link for the next page is a complete link, any help on this would be great.
The error is subtle. You did a good job of converting the sample code, but look at these lines from your code:
Code:
        pager = soup.find('a',attrs={'class':'next'})
        if pager:
           nexturl = pager.a['href']
Compare to the sample code:
Code:
        pager = soup.find('div',attrs={'class':'toolbar_fat_next'})
        if pager:
           nexturl = self.INDEX + pager.a['href']
In the sample, the next page link was inside an <a> tag which was, in turn, inside a <div> tag. The sample code searched for the <div> tag, then grabbed the <a> tag's "href" inside it. In your case, the <a> is marked with the class='next' so you didn't search for its parent, you searched directly for the <a> tag. That's fine, but then you copied the code that looked for an <a> tag inside the tag you found, and there wasn't one.

You need to change nexturl = pager.a['href'] to:
Code:
nexturl = pager['href']
Hold on .... let me test it .....

Yep - That does it. There's still lots of junk in my output, but it's definitely pulling multipages. My recipe may be slightly different from yours, but I think that should get you on your way.

Last edited by Starson17; 06-03-2010 at 03:51 PM.
Starson17 is offline  
Old 06-03-2010, 04:40 PM   #2037
kidtwisted
Member
kidtwisted began at the beginning.
 
kidtwisted's Avatar
 
Posts: 16
Karma: 10
Join Date: May 2010
Location: Southern California
Device: JetBook-Lite
Quote:
Originally Posted by Starson17 View Post

You need to change nexturl = pager.a['href'] to:
Code:
nexturl = pager['href']
Ah OK! - lol noob mistake.

A question about preprocess_html part.
What does the "3" represent in this line?
Code:
self.append_page(soup, soup.body, 3)
Thanks for your help, just needs a little more clean up.
I need to apply this to the pcper.com site now , it's a little more tricky so it might need a different approach.

Thanks again.
kidtwisted is offline  
Old 06-03-2010, 04:54 PM   #2038
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by kidtwisted View Post
A question about preprocess_html part.
What does the "3" represent in this line?
Code:
self.append_page(soup, soup.body, 3)
It's "position" in the insert here: appendtag.insert(position,texttag)

It's saying to insert the text at the 3rd tag position. You can reference locations in Soup by labels (most common) or by tag position number (as above).

Quote:
Thanks for your help, just needs a little more clean up.
I need to apply this to the pcper.com site now , it's a little more tricky so it might need a different approach.

Thanks again.
You're welcome and good luck. I prefer to help others figure out how to do it than to just write it. If you need help with pcper, let us know, and be sure to post your final results here so Kovid can add it to the code for use by others.
Starson17 is offline  
Old 06-03-2010, 10:23 PM   #2039
Semonski
Junior Member
Semonski began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Mar 2010
Device: Kindle DX
Washington Times....

Once again I bow to the gurus! I could use some help on the Washington times recipe. I cobbled this one together below and it worked for quite some time, but now the Washington times has changed their format for their page..... any assistance would be greatly apperciated.


__license__ = 'GPL v3'

'''
washingtontimes.com
'''


from calibre.web.feeds.news import BasicNewsRecipe


class WashingtonTimes(BasicNewsRecipe):

title = 'Washington Times'
__author__ = 'Kos Semonski'
description = 'Daily newspaper'
publisher = 'News World Communications, Inc.'
category = 'news, politics, USA'
oldest_article = 2
max_articles_per_feed = 15
no_stylesheets = True
encoding = 'utf8'
use_embedded_content = False
language = 'en'
masthead_url = 'http://media.washingtontimes.com/media/img/TWTlogo.gif'
extra_css = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} '

conversion_options = {
'comment' : description
, 'tags' : category
, 'publisher' : publisher
, 'language' : language
}


def get_feeds(self):
return [(u'Headlines', u'http://www.washingtontimes.com/rss/headlines/news/headlines/'),
(u'Editor Favs', u'http://www.washingtontimes.com/rss/headlines/news/editor-favorites/'),
(u'Politics', u'http://www.washingtontimes.com/rss/headlines/news/politics/'),
(u'National', u'http://www.washingtontimes.com/rss/headlines/news/national/'),
(u'World', u'http://www.washingtontimes.com/rss/headlines/news/world/'),
(u'Business', u'http://www.washingtontimes.com/rss/headlines/news/business/'),
(u'Technology', u'http://www.washingtontimes.com/rss/headlines/news/technology/'),
(u'Editorials', u'http://www.washingtontimes.com/rss/headlines/opinion/editorials/')
]

def print_version(self, url):
return url + '/print/'
Semonski is offline  
Old 06-03-2010, 11:05 PM   #2040
RLynker
Junior Member
RLynker began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Apr 2010
Device: Kindle2 and Astak EZ Reader Pocket Pro
Request for recipe help

My original post seems to have gotten caught in the fray so I will repost this. I apologize if I missed any responses. Thanks!


Quote:
Originally Posted by RLynker View Post
Hello,

I apologize if I'm asking for something that has already been done, but I can't seem to find it no matter how I do a search through these 125 pages of postings. Nor is it in the list of included recipes in the latest version of Calibre.

I am trying to get a recipe for Maximum PC magazine's RSS feed. Their page is SO simple in the layout, but I've been unsuccessful trying to make a clean recipe for it. I just want the words on the page as they appear when you go to the below link. There's no need to go multiple levels into the hyperlinks. Does anyone have a recipe for this? The webpage for the full RSS (which could be applied to their individual ones as well since the format is identical) is:

http://www.maximumpc.com/articles/all/feed

Thank you very much!

RLynker
RLynker is offline  
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Custom column read ? pchrist7 Calibre 2 10-04-2010 02:52 AM
Archive for custom screensavers sleeplessdave Amazon Kindle 1 07-07-2010 12:33 PM
How to back up preferences and custom recipes? greenapple Calibre 3 03-29-2010 05:08 AM
Donations for Custom Recipes ddavtian Calibre 5 01-23-2010 04:54 PM
Help understanding custom recipes andersent Calibre 0 12-17-2009 02:37 PM


All times are GMT -4. The time now is 06:12 PM.


MobileRead.com is a privately owned, operated and funded community.