Own recipe, hiccups on connection

Selcal · 04-05-2011, 05:13 AM

--Hi all,

For some time I've been working on a recipe to download the Dutch newspaper de Volkskrant, for subscribers (password needed). I've managed to get it working now (of course, right in the middle of my work they overhauled the site :S), but I still stumble across some problems now and then.

The main problem is connection problems. These occur every now and then from home, but they occur frequently when accessing from abroad on a shabby connection, it's a Proxy error:

Code:

Python function terminated unexpectedly
  HTTP Error 502: Proxy Error (Error Code: 1)
Traceback (most recent call last):
  File "site.py", line 103, in main
  File "site.py", line 85, in run_entry_point
  File "site-packages\calibre\utils\ipc\worker.py", line 110, in main
  File "site-packages\calibre\gui2\convert\gui_conversion.py", line 25, in gui_convert
  File "site-packages\calibre\ebooks\conversion\plumber.py", line 904, in run
  File "site-packages\calibre\customize\conversion.py", line 204, in __call__
  File "site-packages\calibre\web\feeds\input.py", line 105, in convert
  File "site-packages\calibre\web\feeds\news.py", line 734, in download
  File "site-packages\calibre\web\feeds\news.py", line 871, in build_index
  File "c:\users\mediac~1\appdata\local\temp\calibre_0.7.45_tmp_2sa8fy\calibre_0.7.45_pxqlpu_recipes\recipe0.py", line 55, in parse_index
    soup = self.index_to_soup(_INDEX)
  File "site-packages\calibre\web\feeds\news.py", line 495, in index_to_soup
  File "site-packages\mechanize-0.2.4-py2.7.egg\mechanize\_mechanize.py", line 199, in open_novisit
  File "site-packages\mechanize-0.2.4-py2.7.egg\mechanize\_mechanize.py", line 255, in _mech_open
mechanize._response.httperror_seek_wrapper: HTTP Error 502: Proxy Error

I don't use a proxy at all, and usually simply restarting the recipe twice or three times (sometimes much more) will in the end lead to a succesful download.

Is there a way to change a Calibre setting, or something in the recipe, to make it retry during the recipe download? It seems a waste of time to simply give up after one error?

Next, I can't find a way to set the recipe tags from the recipe. Right now it defaults to "De Volkskrant, News" and I actually don't want both there (on my Sony reader it will then show up in two categories which I don't want). Can it be set from the recipe?

And finally, can I change the title for the output file from the recipe dynamically? Ie. if I download a different date (yesterdays newspaper for example), I would like the title to read "De Volkskrant" followed by the date of the newspaper... Todays date is of no value.

These problems are in order of priority

. Thanks for any help!

Selcal · 04-06-2011, 03:46 PM

Ok, I understand I can use a try syntax or similar, but my knowledge of Python is too poor and I can't find out how to do it. Here's the code that produces the error:

Code:

def parse_index(self):
        krant = []
        def strip_title(_title):
            i = 0 
            while ((_title[i] <> ":") and (i <= len(_title))): 
               i = i + 1
            return(_title[0:i])		     
        soup = self.index_to_soup(self.INDEX_MAIN)
        mainsoup = soup.find('td', attrs={'id': 'select_page_top'})
        for option in mainsoup.findAll('option'):
           articles = []
           _INDEX = 'http://www.volkskrant.nl/vk-online/VK/' + self.RETRIEVEDATE + '___/' + option['value'] + '/#text'
           _INDEX_ARTICLE = 'http://www.volkskrant.nl/vk-online/VK/' + self.RETRIEVEDATE + '___/' + option['value'] + '/'
           print ''
           print '<-------    Processing section: ' + _INDEX + ' ------------------------->'
           soup = self.index_to_soup(_INDEX)
           for item in soup.findAll('area'):
              art_nr = item['class']
              attrname = art_nr[0:12] + '_section' + option['value'][0:5] + '_' + art_nr[26:len(art_nr)]
              print '==> Found: ' + attrname;
              index_title = soup.find('div', attrs={'class': attrname})
              get_title = index_title['title'];
              _ARTICLE   = _INDEX_ARTICLE + attrname + '.html#text'
              title = get_title;
              print '--> Title: ' + title;
              print '--> URL: ' + _ARTICLE;
              souparticle =  self.index_to_soup(_ARTICLE);
              headerurl = souparticle.findAll('frame')[0]['src'];
              print '--> Read frame name for header: ' + headerurl;
              url = _INDEX_ARTICLE + headerurl[0:len(headerurl)-12] + '_text.html';
              print '--> Corrected URL: ' + url;
              if (get_title <> ''):
                 title = strip_title(get_title)
                 date  = strftime(' %B %Y')
              if (title <> ''):
                 articles.append({
                                         'title'      :title
                                        ,'date'       :date
                                        ,'url'        :url
                                        ,'description':''
                                        })
           krant.append( (option.string, articles))
        return krant

Starson17 · 04-08-2011, 02:34 PM

Quote:

Originally Posted by Selcal

Is there a way to change a Calibre setting, or something in the recipe, to make it retry during the recipe download? It seems a waste of time to simply give up after one error?

I had trouble with some of the comic recipe with intermittent errors. To get it to retry I used:

Code:

try:
except:

and counted the errors, then left the except handler if there have been too many. For some recipes, such as comic recipes, I just move on to the next comic if I get an error.

Selcal · 04-09-2011, 06:44 AM

Thanks, that sounds like a good possibility. Can you give a quick example of how you use those statements? Online I can't find something that I can easily work into what I understand. Maybe you can give me a quick snip of the comic code where you use this?

Thanks for your help!

Starson17 · 04-11-2011, 10:49 AM

Quote:

Originally Posted by Selcal

Thanks, that sounds like a good possibility. Can you give a quick example of how you use those statements? Online I can't find something that I can easily work into what I understand. Maybe you can give me a quick snip of the comic code where you use this?

Thanks for your help!

I went back and the code is not quite as I remember it. You can look at it yourself, it's in the gocomics.com builtin, but here's a relvant piece:

Spoiler:

Code:

    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        pages = range(1, self.num_comics_to_get+1)
        for page in pages:
            page_soup = self.index_to_soup(url)
            if page_soup:
                try:
                  strip_title = page_soup.find(name='div', attrs={'class':'top'}).h1.a.string
                except:
                  strip_title = 'Error - no Title found'
                try:
                  date_title = page_soup.find('ul', attrs={'class': 'feature-nav'}).li.string
                  if not date_title:
                      date_title = page_soup.find('ul', attrs={'class': 'feature-nav'}).li.string
                except:
                  date_title = 'Error - no Date found'
                title = strip_title + ' - ' + date_title
                for i in range(2):
                  try:
                    strip_url_date = page_soup.find(name='div', attrs={'class':'top'}).h1.a['href']
                    break #success - this is normal exit
                  except:
                    strip_url_date = None
                    continue #try to get strip_url_date again
                for i in range(2):
                  try:
                    prev_strip_url_date = page_soup.find('a', attrs={'class': 'prev'})['href']
                    break #success - this is normal exit
                  except:
                    prev_strip_url_date = None
                    continue #try to get prev_strip_url_date again
                if strip_url_date:
                  page_url = 'http://www.gocomics.com' + strip_url_date
                else:
                  continue
                if prev_strip_url_date:
                  prev_page_url = 'http://www.gocomics.com' + prev_strip_url_date
                else:
                  continue
            current_articles.append({'title': title, 'url': page_url, 'description':'', 'date':''})
            url = prev_page_url
        current_articles.reverse()
        return current_articles

I think you'll want to carefully look at your exact error. In my case, I had trouble understanding what was failing. I would get an error that an element on the page wasn't found, the recipe would bomb, then I'd print the soup, and I'd find that element. It seemed to be cured with the code above. In your case, you may need to do the page fetch multiple times. The code above particularly the "for i in range(2):" parts seem to have only fetched once and I vaguely recall puzzling why I couldn't find content that seemed to be there, so I added some retries of the href find. In your case, it should be possible to add multiple fetches if that's needed.

Selcal · 04-12-2011, 07:41 AM

I think I can adapt that for the fetch -- it's definitely the fetch that fails in my case. I'm on a dodgy connection now so it's the perfect time to try.

Selcal · 04-13-2011, 08:46 AM

Seems to work so far! I'll test it for awhile, if it stays stable I'll put the recipe up!

Thanks for your help here Starson17!

Any chance on the tag/title setting from the recipe?

Starson17 · 04-13-2011, 10:00 AM

Quote:

Originally Posted by Selcal

Any chance on the tag/title setting from the recipe?

I'm a bit embarrassed that I'm not sure about that, and hoped Kovid would answer it for both of us. I have seen it discussed, and to the best of my recollection (but don't rely on this):

The tags can't be set from inside the recipe. You can turn off the recipe name tag and add other tags from the GUI.

The title - I know you can kill the date part of the title, and I'm pretty sure you can change it to any date you want. There are recipes that do remove the date so that new versions replace old versions on the reader. I just didn't have the time to search the code to tell you how.

Of course, the recipes are so powerful,, that if you want to get really deep, you can usually do anything you want, so even the tags may be possible.

Sorry I can't help more. I'm a bit jammed up now with work.

kovidgoyal · 04-13-2011, 10:49 AM

You cannot affect tags from within the recipe.

Set

timefmt = ''

to remove the date from the title.

Selcal · 04-18-2011, 05:17 AM

Thanks! I've set that as an option in case the recipe is used to download a specific date. I'll try to get the tags they way I want in the GUI.

So far the recipe seems to work reliably. I'll test it for this whole week.

Selcal · 04-27-2011, 12:07 PM

A week with no trouble! I'm happy to release this now. Python is not my first language, so to say, so any optimizations are welcome. I've tried to output relevant information so the job details will show what is happening.

File attached zipped, and visible here:

Spoiler:

Code:

from calibre import strftime
from calibre.ebooks.BeautifulSoup import BeautifulSoup
from BeautifulSoup import BeautifulStoneSoup
from calibre.web.feeds.news import BasicNewsRecipe

class Volkskrant_full(BasicNewsRecipe):
    # This recipe will download the Volkskrant newspaper,
    # from the subscribers site. It requires a password.
    # Known issues are: articles that are spread out over
    # multiple pages will appear multiple times. Pages
    # that contain only adverts will appear, but empty.
    # The supplement 'Volkskrant Magazine' on saturday
    # is currently not downloaded.
    # You can set a manual date, to download an archived
    # newspaper. Volkskrant stores over a month at the
    # moment of writing. To do so I suggest you unmark
    # the date on the line below, and insert it in the title. Then
    # follow the instructions marked further below.

    title = 'De Volkskrant' # [za, 13 nov 2010]'
    __author__ = u'Selcal'
    description = u"Volkskrant"
    oldest_article = 30
    max_articles_per_feed = 100
    no_stylesheets = True
    use_embedded_content = False
    simultaneous_downloads = 1
    delay = 1
    needs_subscription = True
    # Set RETRIEVEDATE to 'yyyymmdd' to load an older
    # edition. Otherwise keep '%Y%m%d'
    # When setting a manual date, unmark and add the date
    # to the title above, and unmark the timefmt line to stop
    # Calibre from adding today's date in addition.

    # timefmt = ''
    RETRIEVEDATE = strftime('%Y%m%d')
    INDEX_MAIN = 'http://www.volkskrant.nl/vk-online/VK/' + RETRIEVEDATE + '___/VKN01_001/#text'
    INDEX_ARTICLE = 'http://www.volkskrant.nl/vk-online/VK/' + RETRIEVEDATE + '___/VKN01_001/'
    LOGIN = 'http://www.volkskrant.nl/vk/user/loggedIn.do'
    remove_tags = [dict(name='address')]
    cover_url = 'http://www.volkskrant.nl/vk-online/VK/' + RETRIEVEDATE + '___/VKN01_001/page.jpg'
	
    def get_browser(self):
        br = BasicNewsRecipe.get_browser()

        if self.username is not None and self.password is not None:
           br.open(self.LOGIN)
           br.select_form(nr = 0)
           br['username'] = self.username
           br['password'] = self.password
           br.submit()
        return br
        
    def parse_index(self):
        krant = []
        def strip_title(_title):
            i = 0 
            while ((_title[i] <> ":") and (i <= len(_title))): 
               i = i + 1
            return(_title[0:i])		     
        for temp in range (5):
              try:
                soup = self.index_to_soup(self.INDEX_MAIN)
                break
              except:
                print '(Retrying main index load)'
                continue
        mainsoup = soup.find('td', attrs={'id': 'select_page_top'})
        for option in mainsoup.findAll('option'):
           articles = []
           _INDEX = 'http://www.volkskrant.nl/vk-online/VK/' + self.RETRIEVEDATE + '___/' + option['value'] + '/#text'
           _INDEX_ARTICLE = 'http://www.volkskrant.nl/vk-online/VK/' + self.RETRIEVEDATE + '___/' + option['value'] + '/'
           print ''
           print '<-------    Processing section: ' + _INDEX + ' ------------------------->'
           for temp in range (5):
              try:
                soup = self.index_to_soup(_INDEX)
                break
              except:
                print '(Retrying index load)'
                continue
           for item in soup.findAll('area'):
              art_nr = item['class']
              attrname = art_nr[0:12] + '_section' + option['value'][0:5] + '_' + art_nr[26:len(art_nr)]
              print '==> Found: ' + attrname;
              index_title = soup.find('div', attrs={'class': attrname})
              get_title = index_title['title'];
              _ARTICLE   = _INDEX_ARTICLE + attrname + '.html#text'
              title = get_title;
              print '--> Title: ' + title;
              print '--> URL: ' + _ARTICLE;
              for temp in range (5):
                 try:
                   souparticle =  self.index_to_soup(_ARTICLE);
                   break
                 except:
                   print '(Retrying URL load)'
                   continue
              headerurl = souparticle.findAll('frame')[0]['src'];
              print '--> Read frame name for header: ' + headerurl;
              url = _INDEX_ARTICLE + headerurl[0:len(headerurl)-12] + '_text.html';
              print '--> Corrected URL: ' + url;
              if (get_title <> ''):
                 title = strip_title(get_title)
                 date  = strftime(' %B %Y')
              if (title <> ''):
                 articles.append({
                                         'title'      :title
                                        ,'date'       :date
                                        ,'url'        :url
                                        ,'description':''
                                        })
           krant.append( (option.string, articles))
        return krant

ruudh · 05-05-2011, 02:09 AM

Just wanted to thank Selcal for a great piece of work!
Finally I'm able to read my favorite newspaper on my Kindle3.
Any chance you could do a subscription version of NRC Handelsblad as well?

Selcal · 05-09-2011, 04:10 AM

Glad to have been of help

.

As for the NRC, I have no idea. I haven't got a subscription to that, and I don't know how the subscribers site is set up. So without being a subscriber it's pretty much impossible I'm afraid...

04-05-2011, 05:13 AM	#1
Selcal Member Posts: 16 Karma: 10 Join Date: Jul 2010 Device: PRS600 / Cybook Opus	[Updated] Recipe release for De Volkskrant --Hi all, For some time I've been working on a recipe to download the Dutch newspaper de Volkskrant, for subscribers (password needed). I've managed to get it working now (of course, right in the middle of my work they overhauled the site :S), but I still stumble across some problems now and then. The main problem is connection problems. These occur every now and then from home, but they occur frequently when accessing from abroad on a shabby connection, it's a Proxy error: Code: Python function terminated unexpectedly HTTP Error 502: Proxy Error (Error Code: 1) Traceback (most recent call last): File "site.py", line 103, in main File "site.py", line 85, in run_entry_point File "site-packages\calibre\utils\ipc\worker.py", line 110, in main File "site-packages\calibre\gui2\convert\gui_conversion.py", line 25, in gui_convert File "site-packages\calibre\ebooks\conversion\plumber.py", line 904, in run File "site-packages\calibre\customize\conversion.py", line 204, in __call__ File "site-packages\calibre\web\feeds\input.py", line 105, in convert File "site-packages\calibre\web\feeds\news.py", line 734, in download File "site-packages\calibre\web\feeds\news.py", line 871, in build_index File "c:\users\mediac~1\appdata\local\temp\calibre_0.7.45_tmp_2sa8fy\calibre_0.7.45_pxqlpu_recipes\recipe0.py", line 55, in parse_index soup = self.index_to_soup(_INDEX) File "site-packages\calibre\web\feeds\news.py", line 495, in index_to_soup File "site-packages\mechanize-0.2.4-py2.7.egg\mechanize\_mechanize.py", line 199, in open_novisit File "site-packages\mechanize-0.2.4-py2.7.egg\mechanize\_mechanize.py", line 255, in _mech_open mechanize._response.httperror_seek_wrapper: HTTP Error 502: Proxy Error I don't use a proxy at all, and usually simply restarting the recipe twice or three times (sometimes much more) will in the end lead to a succesful download. Is there a way to change a Calibre setting, or something in the recipe, to make it retry during the recipe download? It seems a waste of time to simply give up after one error? Next, I can't find a way to set the recipe tags from the recipe. Right now it defaults to "De Volkskrant, News" and I actually don't want both there (on my Sony reader it will then show up in two categories which I don't want). Can it be set from the recipe? And finally, can I change the title for the output file from the recipe dynamically? Ie. if I download a different date (yesterdays newspaper for example), I would like the title to read "De Volkskrant" followed by the date of the newspaper... Todays date is of no value. These problems are in order of priority . Thanks for any help! Last edited by Selcal; 04-27-2011 at 02:18 PM.

05-05-2011, 02:09 AM	#12
ruudh Member Posts: 1 Karma: 10 Join Date: Mar 2011 Location: Amsterdam Device: Kindle3	Thanks! Just wanted to thank Selcal for a great piece of work! Finally I'm able to read my favorite newspaper on my Kindle3. Any chance you could do a subscription version of NRC Handelsblad as well?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Recipe works when mocked up as Python file, fails when converted to Recipe	ode	Recipes	7	09-04-2011 04:57 AM
Need some help here, my 360's display's got the hiccups	Francesco	PocketBook	4	08-15-2010 03:16 PM
Wireless internet connection frustrating IDS connection	Socrates	iRex	8	10-21-2009 12:46 PM
First connection	Peregrine	Sony Reader	5	11-02-2006 05:29 PM
Possible E-Mail hiccups at Mobileread	Alexander Turcic	Announcements	1	10-30-2005 07:15 AM

04-09-2011, 06:44 AM	#4
Selcal Member Posts: 16 Karma: 10 Join Date: Jul 2010 Device: PRS600 / Cybook Opus	Thanks, that sounds like a good possibility. Can you give a quick example of how you use those statements? Online I can't find something that I can easily work into what I understand. Maybe you can give me a quick snip of the comic code where you use this? Thanks for your help!

04-12-2011, 07:41 AM	#6
Selcal Member Posts: 16 Karma: 10 Join Date: Jul 2010 Device: PRS600 / Cybook Opus	I think I can adapt that for the fetch -- it's definitely the fetch that fails in my case. I'm on a dodgy connection now so it's the perfect time to try.

04-13-2011, 08:46 AM	#7
Selcal Member Posts: 16 Karma: 10 Join Date: Jul 2010 Device: PRS600 / Cybook Opus	Seems to work so far! I'll test it for awhile, if it stays stable I'll put the recipe up! Thanks for your help here Starson17! Any chance on the tag/title setting from the recipe?

04-13-2011, 10:49 AM	#9
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You cannot affect tags from within the recipe. Set timefmt = '' to remove the date from the title.

04-18-2011, 05:17 AM	#10
Selcal Member Posts: 16 Karma: 10 Join Date: Jul 2010 Device: PRS600 / Cybook Opus	Thanks! I've set that as an option in case the recipe is used to download a specific date. I'll try to get the tags they way I want in the GUI. So far the recipe seems to work reliably. I'll test it for this whole week.

05-09-2011, 04:10 AM	#13
Selcal Member Posts: 16 Karma: 10 Join Date: Jul 2010 Device: PRS600 / Cybook Opus	Glad to have been of help . As for the NRC, I have no idea. I haven't got a subscription to that, and I don't know how the subscribers site is set up. So without being a subscriber it's pretty much impossible I'm afraid...

Advert

Advert