Custom recipes (archive, read-only) - Page 17

Hypernova · 02-18-2009, 06:36 AM

Now for physicstoday.org. Pretty much the same deal. Needed login for some articles. This is essentually the entire magazine, so I think it'll be quite useful. Could you help me with the login again?

Code:

import re

class AdvancedUserRecipe1234950056(BasicNewsRecipe):
    title          = u'Physicstoday'
    oldest_article = 30
    max_articles_per_feed = 100
    no_stylesheets        = True
    use_embedded_content  = False
    remove_javascript     = True
    remove_tags_before = dict(name='h1')
    remove_tags_after   = [dict(name='div', attrs={'id':'footer'})]

    feeds          = [(u'All', u'http://www.physicstoday.org/feed.xml')]

kiklop74 · 02-18-2009, 01:53 PM

New recipe for Serbian newspaper Press:

kovidgoyal · 02-18-2009, 01:59 PM

Quote:

Originally Posted by Hypernova

Now for physicstoday.org. Pretty much the same deal. Needed login for some articles. This is essentually the entire magazine, so I think it'll be quite useful. Could you help me with the login again?

Code:

import re

class AdvancedUserRecipe1234950056(BasicNewsRecipe):
    title          = u'Physicstoday'
    oldest_article = 30
    max_articles_per_feed = 100
    no_stylesheets        = True
    use_embedded_content  = False
    remove_javascript     = True
    remove_tags_before = dict(name='h1')
    remove_tags_after   = [dict(name='div', attrs={'id':'footer'})]

    feeds          = [(u'All', u'http://www.physicstoday.org/feed.xml')]

Since I believe in teaching in a man to fish, here's the login code from physics world

Code:

needs_subscription = True
def get_browser(self):
        br = BasicNewsRecipe.get_browser(self)
        if self.username is not None and self.password is not None:
            br.open('http://physicsworld.com/cws/sign-in')
            br.select_form(nr=1)
            br['username'] = self.username
            br['password'] = self.password
            br.submit()
        return br

XanthanGum · 02-18-2009, 07:46 PM

Quote:

Originally Posted by kiklop74

Updated recipe Ars technica with multipage news support

kiklop74,

Your latest revised Ars Technica recipe seems to be working fine. Thanks a million.

I guess this segment of your code is what fetches articles continued across multiple pages:

Code:

def append_page(self, soup, appendtag, position):
        pager = soup.find('div',attrs={'id':'pager'})
        if pager:           
           for atag in pager.findAll('a',href=True):
               str = self.tag_to_string(atag)
               if str.startswith('Next'):
                  soup2 = self.index_to_soup(atag['href'])
                  texttag = soup2.find('div', attrs={'class':'news-item-text'})
                  for it in texttag.findAll(style=True):
                      del it['style']
                  newpos = len(texttag.contents)          
                  self.append_page(soup2,texttag,newpos)
                  texttag.extract()
                  pager.extract()
                  appendtag.insert(position,texttag)

Again, thanks.

Xanthan Gum

Hypernova · 02-18-2009, 09:07 PM

Quote:

Originally Posted by kovidgoyal

Since I believe in teaching in a man to fish, here's the login code from physics world

I agree with you completely. However, I did try to understand the login code from user manual, but failed. Fortunately, physicstoday is actually closed enough to nytime so that I can use the same code. So here it is.

EDIT: I tried the recipe and it works. So I add some infomation (author, change the class name, etc.) but for some reason that make the login failed. I'm investigating this.

EDIT 2: Seem like I just trying sucessively too many times. It works fine as long as you don't fetch like two times in five minutes, I think.

Hypernova · 02-19-2009, 03:04 AM

I see Physicstoday in 0.4.138

. But please note that I've updated the recipe. It works much better than the old one that included with calibre.

I notice a problem though. EPUB output give me "Protected Page" on my PRS-505 for every page except the Table of Content. I'm investigating this. My guess is it's the reader fault.

kiklop74 · 02-19-2009, 10:20 AM

Aparently people from Harper's Magazine decided to completely remove text version of their printed edition articles leaving only PDF and image version. That change is applied as of March 2009 edition. This means that recipe for printed edition will stop working.

I will see if there is any chance of manipulating pdf format, but since I know how tough format that is I do not expect much. However the recipe might be modified in such way to at least enable download of older issues.

Is there interrest for such thing?

Hypernova · 02-19-2009, 04:37 PM

Is there a way to make calibre go to the print edition when the link pointing there on the article just have "http://ptonline.aip.org/servlet/PrintPTJ"? My Physicstoday recipe has a problem with epub because, on the reader, it will try to render the highslide?(a box will pop up when you click to picture so you can see the bigger version & some explantion) on the main body of the article. Printer-friendy version does not have this, but I have no idea how to point calibre to get it.

kovidgoyal · 02-19-2009, 05:15 PM

you could just remove the highslide box using remove_tags

Ingeneral the print version needs a unique URL per article to work.

Hypernova · 02-19-2009, 05:40 PM

I did what you suggested. But the highslide box include some explanations of the picture, which is quite crucial. Is there a way to leave it at the end of the article or something? I see that the html actuall have the highslide box contents at the end, but I'm not sure how to keep it. It goes like this

Code:

<div class="highslide-html-content" id="highslide-html">
	<div class="highslide-header">
		<ul>
			<li class="highslide-move"><a href="#" onclick="return false">Move</a></li>
			<li class="highslide-close"><a href="#" onclick="return hs.close(this)">Close</a></li>
		</ul>	    
	</div>
	<div class="highslide-body">
 
<body>
<div id="figure">
  <div align="center"><table width="100%" border="0" cellspacing="5" cellpadding="1">
  <tr>
    <td><img src="/journals/doc/PHTOAD-ft/vol_62/iss_2/images/40_1fig1a.jpg" alt="Figure" width="630" height="420" />&nbsp;</td>
  </tr>
  <tr>
    <td><img src="/journals/doc/PHTOAD-ft/vol_62/iss_2/images/40_1fig1b.jpg" alt="Figure" width="511" height="408" />&nbsp;</td>
  </tr>
</table>
  </div>
  
<p><strong>Figure 1.</strong> Snapshots of high-school physics.<strong> (a)</strong>***Some long explanation here***</a>.)</p>
 
</div>
</body>
</div>
    <div class="highslide-footer">
        <div>
            <span class="highslide-resize" title="Resize">
                <span></span>            </span>        </div>
    </div>
</div>

For now, I attached the new one with cover added and remove all highslide contents.

kovidgoyal · 02-19-2009, 05:48 PM

It should be doable by using the postprocess_html method, which allows you to perform arbitrary manipulations on the downloaded html just before it is saved.

So what you will need to do is for each such image figure out the corresponding text and add it ina <p> after the image.

The postproces_html method is passed two parameters a BeautifulSoup instance and a boolean indicating if the HTML is the first page of the article or not. You can use the soup parameter to perform the manipulations. See the documentation of the BeautifulSoup package to understand how to use it.

Emm3t · 02-20-2009, 04:37 AM

Is anyone else having problems with The Economist feed? I'm using the latest version of Calibre (0.4.138), but it just appears to get the titles and not the body.

Expecting user-error, but I can't see what's up.

If additional information is required, please let me know what you need.

Thanks

Emmet

kitzj0 · 02-20-2009, 01:40 PM

I posted a couple of days ago a problem I was having with a feed to my local paper and kiklop74 was kind to provide assistance. The code is:

Code:

class AdvancedUserRecipe1234144423(BasicNewsRecipe):
    title          = u'Cincinnati Enquirer'
    oldest_article = 7
    language       = _('English')
    __author__     = 'Joseph Kitzmiller'
    max_articles_per_feed = 100
    no_stylesheets        = True
    use_embedded_content  = False
    remove_javascript     = True
    encoding = 'cp1252'
    extra_css = ' p {font-size: medium; font-weight: normal;} '
    
    keep_only_tags = [dict(name='div', attrs={'class':'padding'})]
    
    remove_tags = [
                     dict(name=['object','link','table','embed'])
                    ,dict(name='div',attrs={'id':'pluckcomments'})
                    ,dict(name='div',attrs={'class':'articleflex-container'})
                  ]
   
    feeds          = [(u'Cincinnati Enquirer', u'http://rss.cincinnati.com/apps/pbcs.dll/section?category=rssenq01&mime=xml')]

    def preprocess_html(self, soup):
        for item in soup.findAll(style=True):
            del item['style']
        for item in soup.findAll(face=True):
            del item['face']
        return soup

This worked if I manually put the generated epub file into the Sony library software to transfer to my reader. I downloaded the new version of Calibre and now I can use calibre to transfer over the file and I am getting text for the article. However, now in the generated epub file I am getting this overlaying the text:

Starting first parse
.Parsing macro pluck_InitializeArticles
..Build 3: 953 ms (Article)
...Build 3: 46 ms (Article)
..Build 9: 187 ms (Content)
.Completed macro pluck_InitializeArticles
.Build 0: 16 ms (Misc)
.Build 3: 2984 ms (Article)
.Parsing macro seo
..Build 0: 0 ms (Misc)
.Completed macro seo
.Parsing macro sitecatalyst
..Build 0: 0 ms (Misc)
.Completed macro sitecatalyst
..Build 3: 62 ms (Article)
.Parsing macro footer_local
--> Starting first parse
.Build 0: 16 ms (Misc)
.Build 3: 31 ms (Article)
.Build 9: 0 ms (Content)
Retrieve categories: 0ms
Read templates: 0ms
Read objects: 0ms
Scripts: 0ms

the message goes on for several lines. This happens regardless of using the Sony library software or calibre to transfer the feed to the device. Is this a bug?

kovidgoyal · 02-20-2009, 01:45 PM

The format of the Economist webste changed, fix will be in the next release, in the meantime here's the updated recipe

Code:

#!/usr/bin/env  python

__license__   = 'GPL v3'
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
'''
economist.com
'''
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

import mechanize, string
from urllib2 import quote

class Economist(BasicNewsRecipe):
    
    title = 'The Economist'
    language = _('English')
    __author__ = "Kovid Goyal"
    description = 'Global news and current affairs from a European perspective'
    oldest_article = 7.0
    needs_subscription = False # Strange but true
    INDEX = 'http://www.economist.com/printedition'
    remove_tags = [dict(name=['script', 'noscript', 'title'])]
    remove_tags_before = dict(name=lambda tag: tag.name=='title' and tag.parent.name=='body')
    
    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        if self.username is not None and self.password is not None:
            req = mechanize.Request('http://www.economist.com/members/members.cfm?act=exec_login', headers={'Referer':'http://www.economist.com'})
            data = 'logging_in=Y&returnURL=http%253A%2F%2Fwww.economist.com%2Findex.cfm&email_address=username&pword=password&x=7&y=11'
            data = data.replace('username', quote(self.username)).replace('password', quote(self.password))
            req.add_data(data)
            br.open(req).read()
        return br
    
    def parse_index(self):
        soup = BeautifulSoup(self.browser.open(self.INDEX).read(),
                             convertEntities=BeautifulSoup.HTML_ENTITIES)
        index_started = False
        feeds = {}
        ans = []
        key = None
        for tag in soup.findAll(['h1', 'h2']):
            text = ''.join(tag.findAll(text=True))
            if tag.name == 'h1':
                if 'Classified ads' in text:
                    break
                if 'The world this week' in text:
                    index_started = True
                if not index_started:
                    continue
                text = string.capwords(text)
                if text not in feeds.keys():
                    feeds[text] = []
                if text not in ans:
                    ans.append(text)
                key = text
                continue
            if key is None:
                continue
            a = tag.find('a', href=True)
            if a is not None:
                url=a['href'].replace('displaystory', 'PrinterFriendly') 
                if url.startswith('/'):
                    url = 'http://www.economist.com' + url
                article = dict(title=text, 
                    url = url,
                    description='', content='', date='')
                feeds[key].append(article)
                
        ans = [(key, feeds[key]) for key in ans if feeds.has_key(key)]
        return ans

kovidgoyal · 02-20-2009, 02:07 PM

Will be fixed in the next release

Quote:

Originally Posted by kitzj0

I posted a couple of days ago a problem I was having with a feed to my local paper and kiklop74 was kind to provide assistance. The code is:

Code:

class AdvancedUserRecipe1234144423(BasicNewsRecipe):
    title          = u'Cincinnati Enquirer'
    oldest_article = 7
    language       = _('English')
    __author__     = 'Joseph Kitzmiller'
    max_articles_per_feed = 100
    no_stylesheets        = True
    use_embedded_content  = False
    remove_javascript     = True
    encoding = 'cp1252'
    extra_css = ' p {font-size: medium; font-weight: normal;} '
    
    keep_only_tags = [dict(name='div', attrs={'class':'padding'})]
    
    remove_tags = [
                     dict(name=['object','link','table','embed'])
                    ,dict(name='div',attrs={'id':'pluckcomments'})
                    ,dict(name='div',attrs={'class':'articleflex-container'})
                  ]
   
    feeds          = [(u'Cincinnati Enquirer', u'http://rss.cincinnati.com/apps/pbcs.dll/section?category=rssenq01&mime=xml')]

    def preprocess_html(self, soup):
        for item in soup.findAll(style=True):
            del item['style']
        for item in soup.findAll(face=True):
            del item['face']
        return soup

This worked if I manually put the generated epub file into the Sony library software to transfer to my reader. I downloaded the new version of Calibre and now I can use calibre to transfer over the file and I am getting text for the article. However, now in the generated epub file I am getting this overlaying the text:

Starting first parse
.Parsing macro pluck_InitializeArticles
..Build 3: 953 ms (Article)
...Build 3: 46 ms (Article)
..Build 9: 187 ms (Content)
.Completed macro pluck_InitializeArticles
.Build 0: 16 ms (Misc)
.Build 3: 2984 ms (Article)
.Parsing macro seo
..Build 0: 0 ms (Misc)
.Completed macro seo
.Parsing macro sitecatalyst
..Build 0: 0 ms (Misc)
.Completed macro sitecatalyst
..Build 3: 62 ms (Article)
.Parsing macro footer_local
--> Starting first parse
.Build 0: 16 ms (Misc)
.Build 3: 31 ms (Article)
.Build 9: 0 ms (Content)
Retrieve categories: 0ms
Read templates: 0ms
Read objects: 0ms
Scripts: 0ms

the message goes on for several lines. This happens regardless of using the Sony library software or calibre to transfer the feed to the device. Is this a bug?

02-18-2009, 06:36 AM	#241
Hypernova Hyperreader Posts: 130 Karma: 28678 Join Date: Feb 2009 Device: Current: Boox Leaf2 (broken) Past: H2O, Kindle PW1, DXG;Pocketbook 360	Physicstoday.org Now for physicstoday.org. Pretty much the same deal. Needed login for some articles. This is essentually the entire magazine, so I think it'll be quite useful. Could you help me with the login again? Code: import re class AdvancedUserRecipe1234950056(BasicNewsRecipe): title = u'Physicstoday' oldest_article = 30 max_articles_per_feed = 100 no_stylesheets = True use_embedded_content = False remove_javascript = True remove_tags_before = dict(name='h1') remove_tags_after = [dict(name='div', attrs={'id':'footer'})] feeds = [(u'All', u'http://www.physicstoday.org/feed.xml')]

02-20-2009, 04:37 AM	#252
Emm3t Junior Member Posts: 6 Karma: 10 Join Date: Feb 2009 Location: Spain Device: Sony PRS-505	Economist Feed Is anyone else having problems with The Economist feed? I'm using the latest version of Calibre (0.4.138), but it just appears to get the titles and not the body. Expecting user-error, but I can't see what's up. If additional information is required, please let me know what you need. Thanks Emmet

02-20-2009, 01:40 PM	#253
kitzj0 Member Posts: 13 Karma: 10 Join Date: Feb 2009 Device: PRS-505	I posted a couple of days ago a problem I was having with a feed to my local paper and kiklop74 was kind to provide assistance. The code is: Code: class AdvancedUserRecipe1234144423(BasicNewsRecipe): title = u'Cincinnati Enquirer' oldest_article = 7 language = _('English') __author__ = 'Joseph Kitzmiller' max_articles_per_feed = 100 no_stylesheets = True use_embedded_content = False remove_javascript = True encoding = 'cp1252' extra_css = ' p {font-size: medium; font-weight: normal;} ' keep_only_tags = [dict(name='div', attrs={'class':'padding'})] remove_tags = [ dict(name=['object','link','table','embed']) ,dict(name='div',attrs={'id':'pluckcomments'}) ,dict(name='div',attrs={'class':'articleflex-container'}) ] feeds = [(u'Cincinnati Enquirer', u'http://rss.cincinnati.com/apps/pbcs.dll/section?category=rssenq01&mime=xml')] def preprocess_html(self, soup): for item in soup.findAll(style=True): del item['style'] for item in soup.findAll(face=True): del item['face'] return soup This worked if I manually put the generated epub file into the Sony library software to transfer to my reader. I downloaded the new version of Calibre and now I can use calibre to transfer over the file and I am getting text for the article. However, now in the generated epub file I am getting this overlaying the text: Starting first parse .Parsing macro pluck_InitializeArticles ..Build 3: 953 ms (Article) ...Build 3: 46 ms (Article) ..Build 9: 187 ms (Content) .Completed macro pluck_InitializeArticles .Build 0: 16 ms (Misc) .Build 3: 2984 ms (Article) .Parsing macro seo ..Build 0: 0 ms (Misc) .Completed macro seo .Parsing macro sitecatalyst ..Build 0: 0 ms (Misc) .Completed macro sitecatalyst ..Build 3: 62 ms (Article) .Parsing macro footer_local --> Starting first parse .Build 0: 16 ms (Misc) .Build 3: 31 ms (Article) .Build 9: 0 ms (Content) Retrieve categories: 0ms Read templates: 0ms Read objects: 0ms Scripts: 0ms the message goes on for several lines. This happens regardless of using the Sony library software or calibre to transfer the feed to the device. Is this a bug? Last edited by kitzj0; 02-20-2009 at 01:45 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 03:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 01:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 06:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 05:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 03:37 PM

02-19-2009, 10:20 AM	#247
kiklop74 Guru Posts: 800 Karma: 194644 Join Date: Dec 2007 Location: Argentina Device: Kindle Voyage	Aparently people from Harper's Magazine decided to completely remove text version of their printed edition articles leaving only PDF and image version. That change is applied as of March 2009 edition. This means that recipe for printed edition will stop working. I will see if there is any chance of manipulating pdf format, but since I know how tough format that is I do not expect much. However the recipe might be modified in such way to at least enable download of older issues. Is there interrest for such thing?

02-19-2009, 04:37 PM	#248
Hypernova Hyperreader Posts: 130 Karma: 28678 Join Date: Feb 2009 Device: Current: Boox Leaf2 (broken) Past: H2O, Kindle PW1, DXG;Pocketbook 360	Is there a way to make calibre go to the print edition when the link pointing there on the article just have "http://ptonline.aip.org/servlet/PrintPTJ"? My Physicstoday recipe has a problem with epub because, on the reader, it will try to render the highslide?(a box will pop up when you click to picture so you can see the bigger version & some explantion) on the main body of the article. Printer-friendy version does not have this, but I have no idea how to point calibre to get it.

02-19-2009, 05:15 PM	#249
kovidgoyal creator of calibre Posts: 45,622 Karma: 28549046 Join Date: Oct 2006 Location: Mumbai, India Device: Various	you could just remove the highslide box using remove_tags Ingeneral the print version needs a unique URL per article to work.

02-19-2009, 05:48 PM	#251
kovidgoyal creator of calibre Posts: 45,622 Karma: 28549046 Join Date: Oct 2006 Location: Mumbai, India Device: Various	It should be doable by using the postprocess_html method, which allows you to perform arbitrary manipulations on the downloaded html just before it is saved. So what you will need to do is for each such image figure out the corresponding text and add it ina <p> after the image. The postproces_html method is passed two parameters a BeautifulSoup instance and a boolean indicating if the HTML is the first page of the article or not. You can use the soup parameter to perform the manipulations. See the documentation of the BeautifulSoup package to understand how to use it.

Advert

Advert