Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 01-28-2011, 11:24 AM   #1
sorcer
Junior Member
sorcer began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Jan 2011
Device: Kindle 3 WIFI
Struggling with one website

Hello!

I have tried to fetch one Russian website - www.snob.ru with this code:



import re
from calibre.web.feeds.recipes import BasicNewsRecipe

class Snob(BasicNewsRecipe):
title = 'Snob'
__author__ = 'Me'
description = 'Business news from Russian posh magazine'
timemft = ' [%a, %d %b, %Y]'
needs_subscription = True
oldest_article = 21
max_articles_per_feed = 50
no_stylesheets = True
#delay = 1
use_embedded_content = False
encoding = 'utf8'
publisher = 'Snob Media'
category = 'news, Russia, world'
language = 'ru_RU'
publication_type = 'newsportal'
extra_css = ' body{ font-family: Verdana,Helvetica,Arial,sans-serif } .introduction{font-weight: bold} .story-feature{display: block; padding: 0; border: 1px solid; width: 40%; font-size: small} .story-feature h2{text-align: center; text-transform: uppercase} '
preprocess_regexps = [(re.compile(r'<!--.*?-->', re.DOTALL), lambda m: '')]
conversion_options = {
'comments' : description
,'tags' : category
,'language' : language
,'publisher' : publisher
,'linearize_tables': True
}


def get_browser(self):
br = BasicNewsRecipe.get_browser()
if self.username is not None and self.password is not None:
br.open('http://www.snob.ru/login')
br.select_form(name='auth-wrapper')
br['USERNAME'] = self.username
br['PASSWORD'] = self.password
br.submit()
return br

keep_only_tags = [
dict(name='div', attrs={'class':['layout-block-a layout-block']})
,dict(attrs={'class':['story-body','storybody']})
]

remove_tags = [
dict(name='div', attrs={'class':['story-feature related narrow', 'share-help', 'embedded-hyper', \
'story-feature wide ', 'story-feature narrow']})
,dict(name=['img'])
]

remove_attributes = ['width','height']

feeds = [
('Politics', 'http://www.snob.ru/rss/blog/927'),
('Business', 'http://www.snob.ru/rss/blog/420'),
('Science', 'http://www.snob.ru/rss/blog/171'),
('Children', 'http://www.snob.ru/rss/blog/70'),
('Food and Alcohol', 'http://www.snob.ru/rss/blog/173'),
('Health', 'http://www.snob.ru/rss/blog/174'),
('Culture', 'http://www.snob.ru/rss/blog/683'),
('How to live', 'http://www.snob.ru/rss/blog/170'),
('Sex', 'http://www.snob.ru/rss/blog/69'),
('Interview', 'http://www.snob.ru/rss/blog/805'),
('XX century', 'http://www.snob.ru/rss/blog/416'),
('Editorial', 'http://www.snob.ru/rss/blog/894'),
('Chichvarkin', 'http://www.snob.ru/rss/pblog/8503'),
]


The error I get with this code is about the string 'br.select_form(name='auth-wrapper')'. It says that form 'auth-wrapper' is not found. Does anyone have any ideas how can I authorize on www.snob.ru/login before downloading?

Many thanks in advance.
sorcer is offline   Reply With Quote
Old 01-31-2011, 02:38 AM   #2
sorcer
Junior Member
sorcer began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Jan 2011
Device: Kindle 3 WIFI
OK, I found where was my problem and corrected it. Now there is another issue. Calibre downloads all the links it could find on the RSS page but it does not download articles itself, so I finally received just liks to these articles. What should I enable in code to download article? Embedded_content is enabled already.

Last edited by sorcer; 01-31-2011 at 09:09 AM.
sorcer is offline   Reply With Quote
Advert
Old 01-31-2011, 09:32 AM   #3
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by sorcer View Post
OK, I found where was my problem and corrected it. Now there is another issue. Calibre downloads all the links it could find on the RSS page but it does not download articles itself, so I finally received just liks to these articles. What should I enable in code to download article? Embedded_content is enabled already.
You want Embedded_content disabled. (False) The "embedded content" is what is on the RSS page. You want the article content that is not embedded.
Starson17 is offline   Reply With Quote
Old 01-31-2011, 09:35 AM   #4
sorcer
Junior Member
sorcer began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Jan 2011
Device: Kindle 3 WIFI
Quote:
Originally Posted by Starson17 View Post
You want Embedded_content disabled. (False) The "embedded content" is what is on the RSS page. You want the article content that is not embedded.
Probably you right the idea is that I want article itself, not just its name.
sorcer is offline   Reply With Quote
Old 01-31-2011, 01:53 PM   #5
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by sorcer View Post
Probably you right the idea is that I want article itself, not just its name.
It sounds like your authorization isn't working.
Starson17 is offline   Reply With Quote
Advert
Old 02-13-2011, 10:10 AM   #6
EW1(SG)
Junior Member
EW1(SG) began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Feb 2011
Device: Kindle
Quote:
Originally Posted by Starson17 View Post
It sounds like your authorization isn't working.
I have a problem getting the RSS fed articles from a site where I think the problem is authorization.

Is there a way to see what mechanize and urllib2 are seeing? Following Kovid Goyal's advice to someone else on another thread, I've looked at the Google Reader builtin which appears to have the capability that I'm looking for: to parse an arbitrarily complex login page, but I am not familiar with Python or with the APIs for the methods used and I am having trouble discerning what some of the functions do.

If I could see what the results of each statement were, it would go a long ways to helping me understand what I'm trying to do.

Thanks,

Last edited by EW1(SG); 02-13-2011 at 11:53 AM.
EW1(SG) is offline   Reply With Quote
Old 02-14-2011, 09:12 AM   #7
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by EW1(SG) View Post
I have a problem getting the RSS fed articles from a site where I think the problem is authorization.

Is there a way to see what mechanize and urllib2 are seeing? Following Kovid Goyal's advice to someone else on another thread, I've looked at the Google Reader builtin which appears to have the capability that I'm looking for: to parse an arbitrarily complex login page, but I am not familiar with Python or with the APIs for the methods used and I am having trouble discerning what some of the functions do.

If I could see what the results of each statement were, it would go a long ways to helping me understand what I'm trying to do.

Thanks,
I wrote the authorization portion of Google Reader, and you are right - you need to see the http header handshaking to debug.
After
Code:
def get_browser(self):
    br = BasicNewsRecipe.get_browser()
You need to set the following debug options:
Code:
    # Print HTTP headers. and other debugging messages
    br.set_debug_http(True)
    br.set_debug_redirects(True)
    br.set_debug_responses(True)
Starson17 is offline   Reply With Quote
Old 02-14-2011, 09:35 AM   #8
EW1(SG)
Junior Member
EW1(SG) began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Feb 2011
Device: Kindle
Ah...excellent!! Thank you!
EW1(SG) is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Struggling with CLI Paulinafrica Calibre 4 01-08-2011 02:04 PM
struggling with calibre aprilquilts Amazon Kindle 9 12-19-2010 03:05 AM
DR800/DR1000 Website archive browser (website in .ZIP file) luite iRex 44 08-14-2010 12:52 AM
Struggling making a decision - looking for advice abeaty Which one should I buy? 7 07-09-2010 10:27 PM
Struggling making a decision - looking for advice abeaty Apple Devices 1 07-09-2010 04:03 PM


All times are GMT -4. The time now is 11:44 AM.


MobileRead.com is a privately owned, operated and funded community.