Struggling with one website

sorcer · 01-28-2011, 11:24 AM

Hello!

I have tried to fetch one Russian website - www.snob.ru with this code:

import re
from calibre.web.feeds.recipes import BasicNewsRecipe

class Snob(BasicNewsRecipe):
title = 'Snob'
__author__ = 'Me'
description = 'Business news from Russian posh magazine'
timemft = ' [%a, %d %b, %Y]'
needs_subscription = True
oldest_article = 21
max_articles_per_feed = 50
no_stylesheets = True
#delay = 1
use_embedded_content = False
encoding = 'utf8'
publisher = 'Snob Media'
category = 'news, Russia, world'
language = 'ru_RU'
publication_type = 'newsportal'
extra_css = ' body{ font-family: Verdana,Helvetica,Arial,sans-serif } .introduction{font-weight: bold} .story-feature{display: block; padding: 0; border: 1px solid; width: 40%; font-size: small} .story-feature h2{text-align: center; text-transform: uppercase} '
preprocess_regexps = [(re.compile(r'', re.DOTALL), lambda m: '')]
conversion_options = {
'comments' : description
,'tags' : category
,'language' : language
,'publisher' : publisher
,'linearize_tables': True
}

def get_browser(self):
br = BasicNewsRecipe.get_browser()
if self.username is not None and self.password is not None:
br.open('http://www.snob.ru/login')
br.select_form(name='auth-wrapper')
br['USERNAME'] = self.username
br['PASSWORD'] = self.password
br.submit()
return br

keep_only_tags = [
dict(name='div', attrs={'class':['layout-block-a layout-block']})
,dict(attrs={'class':['story-body','storybody']})
]

remove_tags = [
dict(name='div', attrs={'class':['story-feature related narrow', 'share-help', 'embedded-hyper', \
'story-feature wide ', 'story-feature narrow']})
,dict(name=['img'])
]

remove_attributes = ['width','height']

feeds = [
('Politics', 'http://www.snob.ru/rss/blog/927'),
('Business', 'http://www.snob.ru/rss/blog/420'),
('Science', 'http://www.snob.ru/rss/blog/171'),
('Children', 'http://www.snob.ru/rss/blog/70'),
('Food and Alcohol', 'http://www.snob.ru/rss/blog/173'),
('Health', 'http://www.snob.ru/rss/blog/174'),
('Culture', 'http://www.snob.ru/rss/blog/683'),
('How to live', 'http://www.snob.ru/rss/blog/170'),
('Sex', 'http://www.snob.ru/rss/blog/69'),
('Interview', 'http://www.snob.ru/rss/blog/805'),
('XX century', 'http://www.snob.ru/rss/blog/416'),
('Editorial', 'http://www.snob.ru/rss/blog/894'),
('Chichvarkin', 'http://www.snob.ru/rss/pblog/8503'),
]

The error I get with this code is about the string 'br.select_form(name='auth-wrapper')'. It says that form 'auth-wrapper' is not found. Does anyone have any ideas how can I authorize on www.snob.ru/login before downloading?

Many thanks in advance.

sorcer · 01-31-2011, 02:38 AM

OK, I found where was my problem and corrected it. Now there is another issue. Calibre downloads all the links it could find on the RSS page but it does not download articles itself, so I finally received just liks to these articles. What should I enable in code to download article? Embedded_content is enabled already.

Starson17 · 01-31-2011, 09:32 AM

Quote:

Originally Posted by sorcer

OK, I found where was my problem and corrected it. Now there is another issue. Calibre downloads all the links it could find on the RSS page but it does not download articles itself, so I finally received just liks to these articles. What should I enable in code to download article? Embedded_content is enabled already.

You want Embedded_content disabled. (False) The "embedded content" is what is on the RSS page. You want the article content that is not embedded.

sorcer · 01-31-2011, 09:35 AM

Quote:

Originally Posted by Starson17

You want Embedded_content disabled. (False) The "embedded content" is what is on the RSS page. You want the article content that is not embedded.

Probably you right

the idea is that I want article itself, not just its name.

Starson17 · 01-31-2011, 01:53 PM

Quote:

Originally Posted by sorcer

Probably you right

the idea is that I want article itself, not just its name.

It sounds like your authorization isn't working.

EW1(SG) · 02-13-2011, 10:10 AM

Quote:

Originally Posted by Starson17

It sounds like your authorization isn't working.

I have a problem getting the RSS fed articles from a site where I think the problem is authorization.

Is there a way to see what mechanize and urllib2 are seeing? Following Kovid Goyal's advice to someone else on another thread, I've looked at the Google Reader builtin which appears to have the capability that I'm looking for: to parse an arbitrarily complex login page, but I am not familiar with Python or with the APIs for the methods used and I am having trouble discerning what some of the functions do.

If I could see what the results of each statement were, it would go a long ways to helping me understand what I'm trying to do.

Thanks,

Starson17 · 02-14-2011, 09:12 AM

Quote:

Originally Posted by EW1(SG)

I have a problem getting the RSS fed articles from a site where I think the problem is authorization.

Is there a way to see what mechanize and urllib2 are seeing? Following Kovid Goyal's advice to someone else on another thread, I've looked at the Google Reader builtin which appears to have the capability that I'm looking for: to parse an arbitrarily complex login page, but I am not familiar with Python or with the APIs for the methods used and I am having trouble discerning what some of the functions do.

If I could see what the results of each statement were, it would go a long ways to helping me understand what I'm trying to do.

Thanks,

I wrote the authorization portion of Google Reader, and you are right - you need to see the http header handshaking to debug.
After

Code:

def get_browser(self):
    br = BasicNewsRecipe.get_browser()

You need to set the following debug options:

Code:

    # Print HTTP headers. and other debugging messages
    br.set_debug_http(True)
    br.set_debug_redirects(True)
    br.set_debug_responses(True)

EW1(SG) · 02-14-2011, 09:35 AM

Ah...excellent!! Thank you!

01-28-2011, 11:24 AM	#1
sorcer Junior Member Posts: 5 Karma: 10 Join Date: Jan 2011 Device: Kindle 3 WIFI	Struggling with one website Hello! I have tried to fetch one Russian website - www.snob.ru with this code: import re from calibre.web.feeds.recipes import BasicNewsRecipe class Snob(BasicNewsRecipe): title = 'Snob' __author__ = 'Me' description = 'Business news from Russian posh magazine' timemft = ' [%a, %d %b, %Y]' needs_subscription = True oldest_article = 21 max_articles_per_feed = 50 no_stylesheets = True #delay = 1 use_embedded_content = False encoding = 'utf8' publisher = 'Snob Media' category = 'news, Russia, world' language = 'ru_RU' publication_type = 'newsportal' extra_css = ' body{ font-family: Verdana,Helvetica,Arial,sans-serif } .introduction{font-weight: bold} .story-feature{display: block; padding: 0; border: 1px solid; width: 40%; font-size: small} .story-feature h2{text-align: center; text-transform: uppercase} ' preprocess_regexps = [(re.compile(r'<!--.*?-->', re.DOTALL), lambda m: '')] conversion_options = { 'comments' : description ,'tags' : category ,'language' : language ,'publisher' : publisher ,'linearize_tables': True } def get_browser(self): br = BasicNewsRecipe.get_browser() if self.username is not None and self.password is not None: br.open('http://www.snob.ru/login') br.select_form(name='auth-wrapper') br['USERNAME'] = self.username br['PASSWORD'] = self.password br.submit() return br keep_only_tags = [ dict(name='div', attrs={'class':['layout-block-a layout-block']}) ,dict(attrs={'class':['story-body','storybody']}) ] remove_tags = [ dict(name='div', attrs={'class':['story-feature related narrow', 'share-help', 'embedded-hyper', \ 'story-feature wide ', 'story-feature narrow']}) ,dict(name=['img']) ] remove_attributes = ['width','height'] feeds = [ ('Politics', 'http://www.snob.ru/rss/blog/927'), ('Business', 'http://www.snob.ru/rss/blog/420'), ('Science', 'http://www.snob.ru/rss/blog/171'), ('Children', 'http://www.snob.ru/rss/blog/70'), ('Food and Alcohol', 'http://www.snob.ru/rss/blog/173'), ('Health', 'http://www.snob.ru/rss/blog/174'), ('Culture', 'http://www.snob.ru/rss/blog/683'), ('How to live', 'http://www.snob.ru/rss/blog/170'), ('Sex', 'http://www.snob.ru/rss/blog/69'), ('Interview', 'http://www.snob.ru/rss/blog/805'), ('XX century', 'http://www.snob.ru/rss/blog/416'), ('Editorial', 'http://www.snob.ru/rss/blog/894'), ('Chichvarkin', 'http://www.snob.ru/rss/pblog/8503'), ] The error I get with this code is about the string 'br.select_form(name='auth-wrapper')'. It says that form 'auth-wrapper' is not found. Does anyone have any ideas how can I authorize on www.snob.ru/login before downloading? Many thanks in advance.

01-31-2011, 02:38 AM	#2
sorcer Junior Member Posts: 5 Karma: 10 Join Date: Jan 2011 Device: Kindle 3 WIFI	OK, I found where was my problem and corrected it. Now there is another issue. Calibre downloads all the links it could find on the RSS page but it does not download articles itself, so I finally received just liks to these articles. What should I enable in code to download article? Embedded_content is enabled already. Last edited by sorcer; 01-31-2011 at 09:09 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Struggling with CLI	Paulinafrica	Calibre	4	01-08-2011 02:04 PM
struggling with calibre	aprilquilts	Amazon Kindle	9	12-19-2010 03:05 AM
DR800/DR1000 Website archive browser (website in .ZIP file)	luite	iRex	44	08-14-2010 12:52 AM
Struggling making a decision - looking for advice	abeaty	Which one should I buy?	7	07-09-2010 10:27 PM
Struggling making a decision - looking for advice	abeaty	Apple Devices	1	07-09-2010 04:03 PM

02-14-2011, 09:35 AM	#8
EW1(SG) Junior Member Posts: 4 Karma: 10 Join Date: Feb 2011 Device: Kindle	Ah...excellent!! Thank you!

Advert

Advert