Custom recipes (archive, read-only) - Page 182

TonytheBookworm · 09-13-2010, 09:03 PM

Quote:

Originally Posted by Starson17

I'm getting a pretty clean version. I also run Adblock, but that only affects FireFox, not Calibre.

interesting cause look

picture one shows what it looks like when i let calibre fetch the feed.

the second picture shows what it looks like if i build the epub with
> ebook-convert test.recipe myrecipe.epub --test

bhandarisaurabh · 09-13-2010, 09:18 PM

there is already a recipe for foreign policy but it covers rss feeds can anyone make the recipe for print edition
http://www.foreignpolicy.com/issues/current
thanks in advance

bhandarisaurabh · 09-13-2010, 09:26 PM

Quote:

Originally Posted by TonytheBookworm

Here you go I only done 2010. Each year appears to have different formatting but a years worth of stuff should be enough for now

the recipe just gave the recent articles for 13 sep not the entire print magazine

TonytheBookworm · 09-13-2010, 11:28 PM

Quote:

Originally Posted by bhandarisaurabh

the recipe just gave the recent articles for 13 sep not the entire print magazine

I'll look into it. sorry about that.

sdow1 · 09-14-2010, 12:36 AM

Not sure if this is the right thread, but since it's where most of the content for slate seems to be, thought I'd try here first.

For the past few days, every time I download the calibre feed for slate, I just get the cover and the Table of Contents (such as it is, since it just says "all articles"). No content whatsoever. I thought I'd wait through the weekend in case it was a matter of slate itself being relatively "dead" on the weekends, but today it was the same thing. I've also tried downloading at different times of the day, in case that was a problem, but I get the same thing. Cover, but no content.

Help!!

TonytheBookworm · 09-14-2010, 01:13 AM

Quote:

Originally Posted by bhandarisaurabh

the recipe just gave the recent articles for 13 sep not the entire print magazine

Alright here is the thing. That site has TOOOOOOONS of articles. The code below should work now. but what it does it goes through the links and starts with at the top down. I have the max article set to 50. So you will get 50 articles max and then it will stop. If you want 3000 then put in 3000 and hope for the best

There might very well be a more effective method of doing this. I personally do not know it. Secondly, someone with more knowledge than I do might know how to group it by the actual dates. I tested this on my end with the current code and received 50 unique articles for starting at the earliest one being in 9-15 2010

I have pretty much done all I know how at this point to do on this recipe and consider it "working but hopping along" if anyone else cares to take a stab at it. If you get it working 100 percent please share so I can learn from it.

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
class FIELDSTREAM(BasicNewsRecipe):
    title      = 'Down To Earth Archive'
    __author__ = 'Tonythebookworm'
    description = ''
    language = 'en'
    no_stylesheets = True
    publisher           = 'Tonythebookworm'
    category            = ''
    use_embedded_content= False
    no_stylesheets      = True
    oldest_article      = 365
    remove_javascript   = True
    remove_empty_feeds  = True
    masthead_url        = 'http://downtoearth.org.in/themes/DTE/images/DownToEarth-Logo.gif'
    
    
    max_articles_per_feed = 50 # only gets the first 50 articles
    INDEX = 'http://downtoearth.org.in'
    
    #I HAVE LEFT THE PRINT STATEMENTS IN HERE FOR DEBUGGING PURPOSES
    #Fill free to remove them.
    #This will only parse the 2010 archives.  The other ones can be added and SHOULD work.
    
    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"2010 Archives", u"http://downtoearth.org.in/archives/2010"),
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        #print 'The soup is: ', soup
        for item in soup.findAll('div',attrs={'class':'views-field-nothing-2'}):
         # print 'item is: ', item
         
         link = item.find('a')
         linkhref = link['href']
         split1 = linkhref.split("/")
         date  = split1[3]
         print 'DATE IS :', date
         print 'the link is: ', link
            
            
         if link:
          url         = self.INDEX + link['href']
                
          soup = self.index_to_soup(url) 
          #print 'NEW SOUP IS: ', soup
        
           
         for items in soup.findAll('div',attrs={'id':'PageContent'}):
          for nodes in items.findAll('a', href=re.compile('/node')):
            
            
            
            if nodes is not None and not re.search('Next Issue', str(nodes)) and not re.search('Previous Issue', str(nodes)):
             
                print 'LINK2 EX!!! and here is that link: ', nodes['href']
                url         = nodes['href']
                
                title       = self.tag_to_string(nodes)
                
                print 'the title is: ', title
                print 'the url is: ', url
                print 'the title is: ', title
                current_articles.append({'title': date + '--' + title, 'url': url, 'description':'', 'date':''}) # append all this
            
        return current_articles
      

   
    def print_version(self, url):
        split1 = url.split("/")
        print 'THE SPLIT IS: ', split1        
        print_url = 'http://downtoearth.org.in/print' + '/' + split1[2]
        print 'THIS URL WILL PRINT: ', print_url # this is a test string to see what the url is it will return
        return print_url

Starson17 · 09-14-2010, 07:55 AM

Quote:

Originally Posted by TonytheBookworm

interesting cause look

The recipe I tested produces nothing like that. I went back to check your post, and either the recipe you posted has changed or I copied the recipe from one of your other posts before testing it (most likely). If I get a chance, I'll go back and test the recipe in your original post.

Starson17 · 09-14-2010, 03:58 PM

Quote:

Originally Posted by TonytheBookworm

When I run this recipe at the console with
ebook-convert test.recipe output_dir --test -vv > myrecipe.txt
I end up getting a nice formatted article with no junk.
Then when i take and import it into calibre to fully test it. I get junk.

I can (now) confirm I see it also. The weird thing is that you don't get the "junk" using ebook-convert even when that recipe is stripped down to the absolute bare minimum of a feed and nothing more. The junk on the right side disappears, and the comments on the bottom disappear.

TonytheBookworm · 09-14-2010, 04:51 PM

Quote:

Originally Posted by Starson17

I can (now) confirm I see it also. The weird thing is that you don't get the "junk" using ebook-convert even when that recipe is stripped down to the absolute bare minimum of a feed and nothing more. The junk on the right side disappears, and the comments on the bottom disappear.

yeah, thought i was going crazy there for a second. I filed a bug report on this.

bhandarisaurabh · 09-14-2010, 07:53 PM

Quote:

Originally Posted by TonytheBookworm

Alright here is the thing. That site has TOOOOOOONS of articles. The code below should work now. but what it does it goes through the links and starts with at the top down. I have the max article set to 50. So you will get 50 articles max and then it will stop. If you want 3000 then put in 3000 and hope for the best

There might very well be a more effective method of doing this. I personally do not know it. Secondly, someone with more knowledge than I do might know how to group it by the actual dates. I tested this on my end with the current code and received 50 unique articles for starting at the earliest one being in 9-15 2010

I have pretty much done all I know how at this point to do on this recipe and consider it "working but hopping along" if anyone else cares to take a stab at it. If you get it working 100 percent please share so I can learn from it.

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
class FIELDSTREAM(BasicNewsRecipe):
    title      = 'Down To Earth Archive'
    __author__ = 'Tonythebookworm'
    description = ''
    language = 'en'
    no_stylesheets = True
    publisher           = 'Tonythebookworm'
    category            = ''
    use_embedded_content= False
    no_stylesheets      = True
    oldest_article      = 365
    remove_javascript   = True
    remove_empty_feeds  = True
    masthead_url        = 'http://downtoearth.org.in/themes/DTE/images/DownToEarth-Logo.gif'
    
    
    max_articles_per_feed = 50 # only gets the first 50 articles
    INDEX = 'http://downtoearth.org.in'
    
    #I HAVE LEFT THE PRINT STATEMENTS IN HERE FOR DEBUGGING PURPOSES
    #Fill free to remove them.
    #This will only parse the 2010 archives.  The other ones can be added and SHOULD work.
    
    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"2010 Archives", u"http://downtoearth.org.in/archives/2010"),
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        #print 'The soup is: ', soup
        for item in soup.findAll('div',attrs={'class':'views-field-nothing-2'}):
         # print 'item is: ', item
         
         link = item.find('a')
         linkhref = link['href']
         split1 = linkhref.split("/")
         date  = split1[3]
         print 'DATE IS :', date
         print 'the link is: ', link
            
            
         if link:
          url         = self.INDEX + link['href']
                
          soup = self.index_to_soup(url) 
          #print 'NEW SOUP IS: ', soup
        
           
         for items in soup.findAll('div',attrs={'id':'PageContent'}):
          for nodes in items.findAll('a', href=re.compile('/node')):
            
            
            
            if nodes is not None and not re.search('Next Issue', str(nodes)) and not re.search('Previous Issue', str(nodes)):
             
                print 'LINK2 EX!!! and here is that link: ', nodes['href']
                url         = nodes['href']
                
                title       = self.tag_to_string(nodes)
                
                print 'the title is: ', title
                print 'the url is: ', url
                print 'the title is: ', title
                current_articles.append({'title': date + '--' + title, 'url': url, 'description':'', 'date':''}) # append all this
            
        return current_articles
      

   
    def print_version(self, url):
        split1 = url.split("/")
        print 'THE SPLIT IS: ', split1        
        print_url = 'http://downtoearth.org.in/print' + '/' + split1[2]
        print 'THIS URL WILL PRINT: ', print_url # this is a test string to see what the url is it will return
        return print_url

thanks it worked like a charm it just fetched 2 extra articles from the past issue rest was fine

marbs · 09-16-2010, 06:35 AM

Some of my articles are not being downloaded.
It tell me that it cant download and run with -vv to see why. How do you run with -vv? or can anyone help me with my recipe?

Spoiler:

What am i doing wrong?

4 the

Starson17 · 09-16-2010, 11:01 AM

Quote:

Originally Posted by marbs

Some of my articles are not being downloaded.
It tell me that it cant download and run with -vv to see why. How do you run with -vv? or can anyone help me with my recipe?

Spoiler:

What am i doing wrong?

4 the

Search this thread for "ebook-convert" to see how to use -vv. If you want help on your recipe, post it inside code tags to preserve indents, which are required. (Thanks for using spoiler tags, but they aren't enough to preserve indents. Just edit your post, add code tags inside the spoilers and repaste your indented recipe.)

dred · 09-16-2010, 02:41 PM

BMJ recipe??

Can anyone help me out with a recipe for the British Medical Journal?

The rss page is at http://www.bmj.com/rss/

Unfortunately it's a fairly basic feed, and doesn't even tell you much inside a dedicated news reader. Is it possible to download the attached articles as well as the 'headline' onto Calibre?

Thanks

TonytheBookworm · 09-16-2010, 03:35 PM

Starson17,
Hey sorry to ask this question yet again. I simply am not understanding it yet even after reading the documentation and some of the code you have posted. Basically I'm wondering why this will not work...

Spoiler:

What I'm trying to do is search for all the span tags that contain imageCredit... and then make the span tag a <p> tag. so it will format it better.
As a result though I get no soup and the article is blank

Here is the full code. I was just trying to clean up the ajc recipe a little bit.

Spoiler:

marbs · 09-16-2010, 04:06 PM

Quote:

Originally Posted by Starson17

Search this thread for "ebook-convert" to see how to use -vv. If you want help on your recipe, post it inside code tags to preserve indents, which are required. (Thanks for using spoiler tags, but they aren't enough to preserve indents. Just edit your post, add code tags inside the spoilers and repaste your indented recipe.)

Thanks for the quick reply. So i fixed the code (it is indented now).
I was able to run the test. Found the output folder. What do I look at now??
thanks again

09-14-2010, 12:36 AM	#2720
sdow1 Connoisseur Posts: 55 Karma: 10 Join Date: Apr 2010 Location: new york city Device: nook, ipad	Slate has no content Not sure if this is the right thread, but since it's where most of the content for slate seems to be, thought I'd try here first. For the past few days, every time I download the calibre feed for slate, I just get the cover and the Table of Contents (such as it is, since it just says "all articles"). No content whatsoever. I thought I'd wait through the weekend in case it was a matter of slate itself being relatively "dead" on the weekends, but today it was the same thing. I've also tried downloading at different times of the day, in case that was a problem, but I get the same thing. Cover, but no content. Help!!

09-16-2010, 03:35 PM	#2729
TonytheBookworm Addict Posts: 264 Karma: 62 Join Date: May 2010 Device: kindle 2, kindle 3, Kindle fire	Starson17, Hey sorry to ask this question yet again. I simply am not understanding it yet even after reading the documentation and some of the code you have posted. Basically I'm wondering why this will not work... Spoiler: Code: def preprocess_html(self, soup): for credit_tag in soup.findAll('span', attrs={'class':['imageCredit rightFloat']}): p = Tag(soup, 'p') span.replaceWith(p) p.insert(0, span) return soup What I'm trying to do is search for all the span tags that contain imageCredit... and then make the span tag a <p> tag. so it will format it better. As a result though I get no soup and the article is blank Here is the full code. I was just trying to clean up the ajc recipe a little bit. Spoiler: [code] class AdvancedUserRecipe1282101454(BasicNewsRecipe): title = 'The AJC' __author__ = 'TonytheBookworm' description = 'News from Atlanta and USA' publisher = 'The Atlanta Journal' category = 'news, politics, USA' oldest_article = 1 max_articles_per_feed = 100 no_stylesheets = True masthead_url = 'http://gawand.org/wp-content/uploads/2010/06/ajc-logo.gif' extra_css = ''' h1{font-family:Arial,Helvetica,sans-serif; font-weight:bold;font-size:large;} h2{font-family:Arial,Helvetica,sans-serif; font-weight:normal;font-size:small;} p{font-family:Arial,Helvetica,sans-serif;font-size:small;} body{font-family:Helvetica,Arial,sans-serif;font-size:small;} ''' keep_only_tags = [ dict(name='div', attrs={'class':['cxArticleHeader']}) ,dict(attrs={'id':['cxArticleText']}) ] remove_tags = [ dict(name='div' , attrs={'class':'cxArticleList' }) ,dict(name='div' , attrs={'class':'cxFeedTease' }) ,dict(name='div' , attrs={'class':'cxElementEnlarge' }) ,dict(name='div' , attrs={'id':'cxArticleTools' }) ] feeds = [ ('Breaking News', 'http://www.ajc.com/genericList-rss.do?source=61499'), # ------------------------------------------------------------------- # Here are the different area feeds. Choose which ever one you wish to # read by simply removing the pound sign from it. I currently have it # set to only get the Cobb area # -------------------------------------------------------------------- #('Atlanta & Fulton', 'http://www.ajc.com/section-rss.do?source=atlanta'), #('Clayton', 'http://www.ajc.com/section-rss.do?source=clayton'), #('DeKalb', 'http://www.ajc.com/section-rss.do?source=dekalb'), #('Gwinnett', 'http://www.ajc.com/section-rss.do?source=gwinnett'), #('North Fulton', 'http://www.ajc.com/section-rss.do?source=north-fulton'), #('Metro', 'http://www.ajc.com/section-rss.do?source=news'), #('Cherokee', 'http://www.ajc.com/section-rss.do?source=cherokee'), ('Cobb', 'http://www.ajc.com/section-rss.do?source=cobb'), #('Fayette', 'http://www.ajc.com/section-rss.do?source=fayette'), #('Henry', 'http://www.ajc.com/section-rss.do?source=henry'), #('Q & A', 'http://www.ajc.com/genericList-rss.do?source=77197'), ('Opinions', 'http://www.ajc.com/section-rss.do?source=opinion'), ('Ga Politics', 'http://www.ajc.com/section-rss.do?source=georgia-politics-elections'), # ------------------------------------------------------------------------ # Here are the different sports feeds. I only follow the Falcons, and Highschool # but again # You can enable which ever team you like by removing the pound sign # ------------------------------------------------------------------------ #('Sports News', 'http://www.ajc.com/genericList-rss.do?source=61510'), #('Braves', 'http://www.ajc.com/genericList-rss.do?source=61457'), ('Falcons', 'http://www.ajc.com/genericList-rss.do?source=61458'), #('Hawks', 'http://www.ajc.com/genericList-rss.do?source=61522'), #('Dawgs', 'http://www.ajc.com/genericList-rss.do?source=61492'), #('Yellowjackets', 'http://www.ajc.com/genericList-rss.do?source=61523'), ('Highschool', 'http://www.ajc.com/section-rss.do?source=high-school'), ('Events', 'http://www.accessatlanta.com/section-rss.do?source=events'), ('Music', 'http://www.accessatlanta.com/section-rss.do?source=music'), ] def preprocess_html(self, soup): for credit_tag in soup.findAll('span', attrs={'class':['imageCredit rightFloat']}): p = Tag(soup, 'p') span.replaceWith(p) p.insert(0, span) return soup #def print_version(self, url): # return url.partition('?')[0] +'?printArticle=y' [/code

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 02:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 12:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 05:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 04:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 02:37 PM

09-13-2010, 09:18 PM	#2717
bhandarisaurabh Enthusiast Posts: 49 Karma: 10 Join Date: Aug 2009 Device: none	there is already a recipe for foreign policy but it covers rss feeds can anyone make the recipe for print edition http://www.foreignpolicy.com/issues/current thanks in advance

09-16-2010, 02:41 PM	#2728
dred Junior Member Posts: 1 Karma: 10 Join Date: Sep 2010 Device: Kindle	BMJ recipe?? Can anyone help me out with a recipe for the British Medical Journal? The rss page is at http://www.bmj.com/rss/ Unfortunately it's a fairly basic feed, and doesn't even tell you much inside a dedicated news reader. Is it possible to download the attached articles as well as the 'headline' onto Calibre? Thanks

Advert

Advert