01-28-2011, 11:24 AM | #1 |
Junior Member
Posts: 5
Karma: 10
Join Date: Jan 2011
Device: Kindle 3 WIFI
|
Struggling with one website
Hello!
I have tried to fetch one Russian website - www.snob.ru with this code: import re from calibre.web.feeds.recipes import BasicNewsRecipe class Snob(BasicNewsRecipe): title = 'Snob' __author__ = 'Me' description = 'Business news from Russian posh magazine' timemft = ' [%a, %d %b, %Y]' needs_subscription = True oldest_article = 21 max_articles_per_feed = 50 no_stylesheets = True #delay = 1 use_embedded_content = False encoding = 'utf8' publisher = 'Snob Media' category = 'news, Russia, world' language = 'ru_RU' publication_type = 'newsportal' extra_css = ' body{ font-family: Verdana,Helvetica,Arial,sans-serif } .introduction{font-weight: bold} .story-feature{display: block; padding: 0; border: 1px solid; width: 40%; font-size: small} .story-feature h2{text-align: center; text-transform: uppercase} ' preprocess_regexps = [(re.compile(r'<!--.*?-->', re.DOTALL), lambda m: '')] conversion_options = { 'comments' : description ,'tags' : category ,'language' : language ,'publisher' : publisher ,'linearize_tables': True } def get_browser(self): br = BasicNewsRecipe.get_browser() if self.username is not None and self.password is not None: br.open('http://www.snob.ru/login') br.select_form(name='auth-wrapper') br['USERNAME'] = self.username br['PASSWORD'] = self.password br.submit() return br keep_only_tags = [ dict(name='div', attrs={'class':['layout-block-a layout-block']}) ,dict(attrs={'class':['story-body','storybody']}) ] remove_tags = [ dict(name='div', attrs={'class':['story-feature related narrow', 'share-help', 'embedded-hyper', \ 'story-feature wide ', 'story-feature narrow']}) ,dict(name=['img']) ] remove_attributes = ['width','height'] feeds = [ ('Politics', 'http://www.snob.ru/rss/blog/927'), ('Business', 'http://www.snob.ru/rss/blog/420'), ('Science', 'http://www.snob.ru/rss/blog/171'), ('Children', 'http://www.snob.ru/rss/blog/70'), ('Food and Alcohol', 'http://www.snob.ru/rss/blog/173'), ('Health', 'http://www.snob.ru/rss/blog/174'), ('Culture', 'http://www.snob.ru/rss/blog/683'), ('How to live', 'http://www.snob.ru/rss/blog/170'), ('Sex', 'http://www.snob.ru/rss/blog/69'), ('Interview', 'http://www.snob.ru/rss/blog/805'), ('XX century', 'http://www.snob.ru/rss/blog/416'), ('Editorial', 'http://www.snob.ru/rss/blog/894'), ('Chichvarkin', 'http://www.snob.ru/rss/pblog/8503'), ] The error I get with this code is about the string 'br.select_form(name='auth-wrapper')'. It says that form 'auth-wrapper' is not found. Does anyone have any ideas how can I authorize on www.snob.ru/login before downloading? Many thanks in advance. |
01-31-2011, 02:38 AM | #2 |
Junior Member
Posts: 5
Karma: 10
Join Date: Jan 2011
Device: Kindle 3 WIFI
|
OK, I found where was my problem and corrected it. Now there is another issue. Calibre downloads all the links it could find on the RSS page but it does not download articles itself, so I finally received just liks to these articles. What should I enable in code to download article? Embedded_content is enabled already.
Last edited by sorcer; 01-31-2011 at 09:09 AM. |
Advert | |
|
01-31-2011, 09:32 AM | #3 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
|
|
01-31-2011, 09:35 AM | #4 |
Junior Member
Posts: 5
Karma: 10
Join Date: Jan 2011
Device: Kindle 3 WIFI
|
|
01-31-2011, 01:53 PM | #5 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
|
Advert | |
|
02-13-2011, 10:10 AM | #6 |
Junior Member
Posts: 4
Karma: 10
Join Date: Feb 2011
Device: Kindle
|
I have a problem getting the RSS fed articles from a site where I think the problem is authorization.
Is there a way to see what mechanize and urllib2 are seeing? Following Kovid Goyal's advice to someone else on another thread, I've looked at the Google Reader builtin which appears to have the capability that I'm looking for: to parse an arbitrarily complex login page, but I am not familiar with Python or with the APIs for the methods used and I am having trouble discerning what some of the functions do. If I could see what the results of each statement were, it would go a long ways to helping me understand what I'm trying to do. Thanks, Last edited by EW1(SG); 02-13-2011 at 11:53 AM. |
02-14-2011, 09:12 AM | #7 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
After Code:
def get_browser(self): br = BasicNewsRecipe.get_browser() Code:
# Print HTTP headers. and other debugging messages br.set_debug_http(True) br.set_debug_redirects(True) br.set_debug_responses(True) |
|
02-14-2011, 09:35 AM | #8 |
Junior Member
Posts: 4
Karma: 10
Join Date: Feb 2011
Device: Kindle
|
Ah...excellent!! Thank you!
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Struggling with CLI | Paulinafrica | Calibre | 4 | 01-08-2011 02:04 PM |
struggling with calibre | aprilquilts | Amazon Kindle | 9 | 12-19-2010 03:05 AM |
DR800/DR1000 Website archive browser (website in .ZIP file) | luite | iRex | 44 | 08-14-2010 12:52 AM |
Struggling making a decision - looking for advice | abeaty | Which one should I buy? | 7 | 07-09-2010 10:27 PM |
Struggling making a decision - looking for advice | abeaty | Apple Devices | 1 | 07-09-2010 04:03 PM |