![]() |
#241 |
Hyperreader
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 130
Karma: 28678
Join Date: Feb 2009
Device: Current: Boox Leaf2 (broken) Past: H2O, Kindle PW1, DXG;Pocketbook 360
|
Physicstoday.org
Now for physicstoday.org. Pretty much the same deal. Needed login for some articles. This is essentually the entire magazine, so I think it'll be quite useful. Could you help me with the login again?
![]() Code:
import re class AdvancedUserRecipe1234950056(BasicNewsRecipe): title = u'Physicstoday' oldest_article = 30 max_articles_per_feed = 100 no_stylesheets = True use_embedded_content = False remove_javascript = True remove_tags_before = dict(name='h1') remove_tags_after = [dict(name='div', attrs={'id':'footer'})] feeds = [(u'All', u'http://www.physicstoday.org/feed.xml')] |
![]() |
![]() |
#242 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
|
New recipe for Serbian newspaper Press:
|
![]() |
Advert | |
|
![]() |
#243 | |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,398
Karma: 27756918
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
![]() Quote:
Code:
needs_subscription = True def get_browser(self): br = BasicNewsRecipe.get_browser(self) if self.username is not None and self.password is not None: br.open('http://physicsworld.com/cws/sign-in') br.select_form(nr=1) br['username'] = self.username br['password'] = self.password br.submit() return br |
|
![]() |
![]() |
#244 |
Connoisseur
![]() Posts: 51
Karma: 10
Join Date: Dec 2008
Location: Germany
Device: SONY PRS-500
|
Ars Technica Now Fetching Entire Article! Super!
kiklop74,
Your latest revised Ars Technica recipe seems to be working fine. Thanks a million. I guess this segment of your code is what fetches articles continued across multiple pages: Code:
def append_page(self, soup, appendtag, position):
pager = soup.find('div',attrs={'id':'pager'})
if pager:
for atag in pager.findAll('a',href=True):
str = self.tag_to_string(atag)
if str.startswith('Next'):
soup2 = self.index_to_soup(atag['href'])
texttag = soup2.find('div', attrs={'class':'news-item-text'})
for it in texttag.findAll(style=True):
del it['style']
newpos = len(texttag.contents)
self.append_page(soup2,texttag,newpos)
texttag.extract()
pager.extract()
appendtag.insert(position,texttag)
Xanthan Gum Last edited by XanthanGum; 02-18-2009 at 06:49 PM. Reason: To correct code entry |
![]() |
![]() |
#245 | |
Hyperreader
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 130
Karma: 28678
Join Date: Feb 2009
Device: Current: Boox Leaf2 (broken) Past: H2O, Kindle PW1, DXG;Pocketbook 360
|
Physics Today magazine recipe
Quote:
EDIT: I tried the recipe and it works. So I add some infomation (author, change the class name, etc.) but for some reason that make the login failed. I'm investigating this. EDIT 2: Seem like I just trying sucessively too many times. It works fine as long as you don't fetch like two times in five minutes, I think. Last edited by Hypernova; 02-19-2009 at 02:06 AM. Reason: Better version recipe in the next post |
|
![]() |
Advert | |
|
![]() |
#246 |
Hyperreader
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 130
Karma: 28678
Join Date: Feb 2009
Device: Current: Boox Leaf2 (broken) Past: H2O, Kindle PW1, DXG;Pocketbook 360
|
I see Physicstoday in 0.4.138
![]() I notice a problem though. EPUB output give me "Protected Page" on my PRS-505 for every page except the Table of Content. I'm investigating this. My guess is it's the reader fault. Last edited by Hypernova; 02-19-2009 at 02:07 AM. |
![]() |
![]() |
#247 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
|
Aparently people from Harper's Magazine decided to completely remove text version of their printed edition articles leaving only PDF and image version. That change is applied as of March 2009 edition. This means that recipe for printed edition will stop working.
I will see if there is any chance of manipulating pdf format, but since I know how tough format that is I do not expect much. However the recipe might be modified in such way to at least enable download of older issues. Is there interrest for such thing? |
![]() |
![]() |
#248 |
Hyperreader
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 130
Karma: 28678
Join Date: Feb 2009
Device: Current: Boox Leaf2 (broken) Past: H2O, Kindle PW1, DXG;Pocketbook 360
|
Is there a way to make calibre go to the print edition when the link pointing there on the article just have "http://ptonline.aip.org/servlet/PrintPTJ"? My Physicstoday recipe has a problem with epub because, on the reader, it will try to render the highslide?(a box will pop up when you click to picture so you can see the bigger version & some explantion) on the main body of the article. Printer-friendy version does not have this, but I have no idea how to point calibre to get it.
|
![]() |
![]() |
#249 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,398
Karma: 27756918
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
you could just remove the highslide box using remove_tags
Ingeneral the print version needs a unique URL per article to work. |
![]() |
![]() |
#250 |
Hyperreader
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 130
Karma: 28678
Join Date: Feb 2009
Device: Current: Boox Leaf2 (broken) Past: H2O, Kindle PW1, DXG;Pocketbook 360
|
I did what you suggested. But the highslide box include some explanations of the picture, which is quite crucial. Is there a way to leave it at the end of the article or something? I see that the html actuall have the highslide box contents at the end, but I'm not sure how to keep it. It goes like this
Code:
<div class="highslide-html-content" id="highslide-html"> <div class="highslide-header"> <ul> <li class="highslide-move"><a href="#" onclick="return false">Move</a></li> <li class="highslide-close"><a href="#" onclick="return hs.close(this)">Close</a></li> </ul> </div> <div class="highslide-body"> <body> <div id="figure"> <div align="center"><table width="100%" border="0" cellspacing="5" cellpadding="1"> <tr> <td><img src="/journals/doc/PHTOAD-ft/vol_62/iss_2/images/40_1fig1a.jpg" alt="Figure" width="630" height="420" /> </td> </tr> <tr> <td><img src="/journals/doc/PHTOAD-ft/vol_62/iss_2/images/40_1fig1b.jpg" alt="Figure" width="511" height="408" /> </td> </tr> </table> </div> <p><strong>Figure 1.</strong> Snapshots of high-school physics.<strong> (a)</strong>***Some long explanation here***</a>.)</p> </div> </body> </div> <div class="highslide-footer"> <div> <span class="highslide-resize" title="Resize"> <span></span> </span> </div> </div> </div> Last edited by Hypernova; 02-19-2009 at 04:43 PM. Reason: Slide mistake in the recipe |
![]() |
![]() |
#251 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,398
Karma: 27756918
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
It should be doable by using the postprocess_html method, which allows you to perform arbitrary manipulations on the downloaded html just before it is saved.
So what you will need to do is for each such image figure out the corresponding text and add it ina <p> after the image. The postproces_html method is passed two parameters a BeautifulSoup instance and a boolean indicating if the HTML is the first page of the article or not. You can use the soup parameter to perform the manipulations. See the documentation of the BeautifulSoup package to understand how to use it. |
![]() |
![]() |
#252 |
Junior Member
![]() Posts: 6
Karma: 10
Join Date: Feb 2009
Location: Spain
Device: Sony PRS-505
|
Economist Feed
Is anyone else having problems with The Economist feed? I'm using the latest version of Calibre (0.4.138), but it just appears to get the titles and not the body.
Expecting user-error, but I can't see what's up. If additional information is required, please let me know what you need. Thanks Emmet |
![]() |
![]() |
#253 |
Member
![]() Posts: 13
Karma: 10
Join Date: Feb 2009
Device: PRS-505
|
I posted a couple of days ago a problem I was having with a feed to my local paper and kiklop74 was kind to provide assistance. The code is:
Code:
class AdvancedUserRecipe1234144423(BasicNewsRecipe): title = u'Cincinnati Enquirer' oldest_article = 7 language = _('English') __author__ = 'Joseph Kitzmiller' max_articles_per_feed = 100 no_stylesheets = True use_embedded_content = False remove_javascript = True encoding = 'cp1252' extra_css = ' p {font-size: medium; font-weight: normal;} ' keep_only_tags = [dict(name='div', attrs={'class':'padding'})] remove_tags = [ dict(name=['object','link','table','embed']) ,dict(name='div',attrs={'id':'pluckcomments'}) ,dict(name='div',attrs={'class':'articleflex-container'}) ] feeds = [(u'Cincinnati Enquirer', u'http://rss.cincinnati.com/apps/pbcs.dll/section?category=rssenq01&mime=xml')] def preprocess_html(self, soup): for item in soup.findAll(style=True): del item['style'] for item in soup.findAll(face=True): del item['face'] return soup Starting first parse .Parsing macro pluck_InitializeArticles ..Build 3: 953 ms (Article) ...Build 3: 46 ms (Article) ..Build 9: 187 ms (Content) .Completed macro pluck_InitializeArticles .Build 0: 16 ms (Misc) .Build 3: 2984 ms (Article) .Parsing macro seo ..Build 0: 0 ms (Misc) .Completed macro seo .Parsing macro sitecatalyst ..Build 0: 0 ms (Misc) .Completed macro sitecatalyst ..Build 3: 62 ms (Article) .Parsing macro footer_local --> Starting first parse .Build 0: 16 ms (Misc) .Build 3: 31 ms (Article) .Build 9: 0 ms (Content) Retrieve categories: 0ms Read templates: 0ms Read objects: 0ms Scripts: 0ms the message goes on for several lines. This happens regardless of using the Sony library software or calibre to transfer the feed to the device. Is this a bug? Last edited by kitzj0; 02-20-2009 at 12:45 PM. |
![]() |
![]() |
#254 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,398
Karma: 27756918
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
The format of the Economist webste changed, fix will be in the next release, in the meantime here's the updated recipe
Code:
#!/usr/bin/env python __license__ = 'GPL v3' __copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>' ''' economist.com ''' from calibre.web.feeds.news import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup import mechanize, string from urllib2 import quote class Economist(BasicNewsRecipe): title = 'The Economist' language = _('English') __author__ = "Kovid Goyal" description = 'Global news and current affairs from a European perspective' oldest_article = 7.0 needs_subscription = False # Strange but true INDEX = 'http://www.economist.com/printedition' remove_tags = [dict(name=['script', 'noscript', 'title'])] remove_tags_before = dict(name=lambda tag: tag.name=='title' and tag.parent.name=='body') def get_browser(self): br = BasicNewsRecipe.get_browser() if self.username is not None and self.password is not None: req = mechanize.Request('http://www.economist.com/members/members.cfm?act=exec_login', headers={'Referer':'http://www.economist.com'}) data = 'logging_in=Y&returnURL=http%253A%2F%2Fwww.economist.com%2Findex.cfm&email_address=username&pword=password&x=7&y=11' data = data.replace('username', quote(self.username)).replace('password', quote(self.password)) req.add_data(data) br.open(req).read() return br def parse_index(self): soup = BeautifulSoup(self.browser.open(self.INDEX).read(), convertEntities=BeautifulSoup.HTML_ENTITIES) index_started = False feeds = {} ans = [] key = None for tag in soup.findAll(['h1', 'h2']): text = ''.join(tag.findAll(text=True)) if tag.name == 'h1': if 'Classified ads' in text: break if 'The world this week' in text: index_started = True if not index_started: continue text = string.capwords(text) if text not in feeds.keys(): feeds[text] = [] if text not in ans: ans.append(text) key = text continue if key is None: continue a = tag.find('a', href=True) if a is not None: url=a['href'].replace('displaystory', 'PrinterFriendly') if url.startswith('/'): url = 'http://www.economist.com' + url article = dict(title=text, url = url, description='', content='', date='') feeds[key].append(article) ans = [(key, feeds[key]) for key in ans if feeds.has_key(key)] return ans |
![]() |
![]() |
#255 | |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,398
Karma: 27756918
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Will be fixed in the next release
Quote:
|
|
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Custom column read ? | pchrist7 | Calibre | 2 | 10-04-2010 02:52 AM |
Archive for custom screensavers | sleeplessdave | Amazon Kindle | 1 | 07-07-2010 12:33 PM |
How to back up preferences and custom recipes? | greenapple | Calibre | 3 | 03-29-2010 05:08 AM |
Donations for Custom Recipes | ddavtian | Calibre | 5 | 01-23-2010 04:54 PM |
Help understanding custom recipes | andersent | Calibre | 0 | 12-17-2009 02:37 PM |