Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 04-05-2011, 05:13 AM   #1
Selcal
Member
Selcal began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Jul 2010
Device: PRS600 / Cybook Opus
[Updated] Recipe release for De Volkskrant

--Hi all,

For some time I've been working on a recipe to download the Dutch newspaper de Volkskrant, for subscribers (password needed). I've managed to get it working now (of course, right in the middle of my work they overhauled the site :S), but I still stumble across some problems now and then.

The main problem is connection problems. These occur every now and then from home, but they occur frequently when accessing from abroad on a shabby connection, it's a Proxy error:
Code:
Python function terminated unexpectedly
  HTTP Error 502: Proxy Error (Error Code: 1)
Traceback (most recent call last):
  File "site.py", line 103, in main
  File "site.py", line 85, in run_entry_point
  File "site-packages\calibre\utils\ipc\worker.py", line 110, in main
  File "site-packages\calibre\gui2\convert\gui_conversion.py", line 25, in gui_convert
  File "site-packages\calibre\ebooks\conversion\plumber.py", line 904, in run
  File "site-packages\calibre\customize\conversion.py", line 204, in __call__
  File "site-packages\calibre\web\feeds\input.py", line 105, in convert
  File "site-packages\calibre\web\feeds\news.py", line 734, in download
  File "site-packages\calibre\web\feeds\news.py", line 871, in build_index
  File "c:\users\mediac~1\appdata\local\temp\calibre_0.7.45_tmp_2sa8fy\calibre_0.7.45_pxqlpu_recipes\recipe0.py", line 55, in parse_index
    soup = self.index_to_soup(_INDEX)
  File "site-packages\calibre\web\feeds\news.py", line 495, in index_to_soup
  File "site-packages\mechanize-0.2.4-py2.7.egg\mechanize\_mechanize.py", line 199, in open_novisit
  File "site-packages\mechanize-0.2.4-py2.7.egg\mechanize\_mechanize.py", line 255, in _mech_open
mechanize._response.httperror_seek_wrapper: HTTP Error 502: Proxy Error
I don't use a proxy at all, and usually simply restarting the recipe twice or three times (sometimes much more) will in the end lead to a succesful download.

Is there a way to change a Calibre setting, or something in the recipe, to make it retry during the recipe download? It seems a waste of time to simply give up after one error?

Next, I can't find a way to set the recipe tags from the recipe. Right now it defaults to "De Volkskrant, News" and I actually don't want both there (on my Sony reader it will then show up in two categories which I don't want). Can it be set from the recipe?

And finally, can I change the title for the output file from the recipe dynamically? Ie. if I download a different date (yesterdays newspaper for example), I would like the title to read "De Volkskrant" followed by the date of the newspaper... Todays date is of no value.

These problems are in order of priority . Thanks for any help!

Last edited by Selcal; 04-27-2011 at 02:18 PM.
Selcal is offline   Reply With Quote
Old 04-06-2011, 03:46 PM   #2
Selcal
Member
Selcal began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Jul 2010
Device: PRS600 / Cybook Opus
Ok, I understand I can use a try syntax or similar, but my knowledge of Python is too poor and I can't find out how to do it. Here's the code that produces the error:
Code:
def parse_index(self):
        krant = []
        def strip_title(_title):
            i = 0 
            while ((_title[i] <> ":") and (i <= len(_title))): 
               i = i + 1
            return(_title[0:i])		     
        soup = self.index_to_soup(self.INDEX_MAIN)
        mainsoup = soup.find('td', attrs={'id': 'select_page_top'})
        for option in mainsoup.findAll('option'):
           articles = []
           _INDEX = 'http://www.volkskrant.nl/vk-online/VK/' + self.RETRIEVEDATE + '___/' + option['value'] + '/#text'
           _INDEX_ARTICLE = 'http://www.volkskrant.nl/vk-online/VK/' + self.RETRIEVEDATE + '___/' + option['value'] + '/'
           print ''
           print '<-------    Processing section: ' + _INDEX + ' ------------------------->'
           soup = self.index_to_soup(_INDEX)
           for item in soup.findAll('area'):
              art_nr = item['class']
              attrname = art_nr[0:12] + '_section' + option['value'][0:5] + '_' + art_nr[26:len(art_nr)]
              print '==> Found: ' + attrname;
              index_title = soup.find('div', attrs={'class': attrname})
              get_title = index_title['title'];
              _ARTICLE   = _INDEX_ARTICLE + attrname + '.html#text'
              title = get_title;
              print '--> Title: ' + title;
              print '--> URL: ' + _ARTICLE;
              souparticle =  self.index_to_soup(_ARTICLE);
              headerurl = souparticle.findAll('frame')[0]['src'];
              print '--> Read frame name for header: ' + headerurl;
              url = _INDEX_ARTICLE + headerurl[0:len(headerurl)-12] + '_text.html';
              print '--> Corrected URL: ' + url;
              if (get_title <> ''):
                 title = strip_title(get_title)
                 date  = strftime(' %B %Y')
              if (title <> ''):
                 articles.append({
                                         'title'      :title
                                        ,'date'       :date
                                        ,'url'        :url
                                        ,'description':''
                                        })
           krant.append( (option.string, articles))
        return krant
Selcal is offline   Reply With Quote
Advert
Old 04-08-2011, 02:34 PM   #3
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by Selcal View Post
Is there a way to change a Calibre setting, or something in the recipe, to make it retry during the recipe download? It seems a waste of time to simply give up after one error?
I had trouble with some of the comic recipe with intermittent errors. To get it to retry I used:
Code:
try:
except:
and counted the errors, then left the except handler if there have been too many. For some recipes, such as comic recipes, I just move on to the next comic if I get an error.
Starson17 is offline   Reply With Quote
Old 04-09-2011, 06:44 AM   #4
Selcal
Member
Selcal began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Jul 2010
Device: PRS600 / Cybook Opus
Thanks, that sounds like a good possibility. Can you give a quick example of how you use those statements? Online I can't find something that I can easily work into what I understand. Maybe you can give me a quick snip of the comic code where you use this?

Thanks for your help!
Selcal is offline   Reply With Quote
Old 04-11-2011, 10:49 AM   #5
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by Selcal View Post
Thanks, that sounds like a good possibility. Can you give a quick example of how you use those statements? Online I can't find something that I can easily work into what I understand. Maybe you can give me a quick snip of the comic code where you use this?

Thanks for your help!
I went back and the code is not quite as I remember it. You can look at it yourself, it's in the gocomics.com builtin, but here's a relvant piece:
Spoiler:
Code:
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        pages = range(1, self.num_comics_to_get+1)
        for page in pages:
            page_soup = self.index_to_soup(url)
            if page_soup:
                try:
                  strip_title = page_soup.find(name='div', attrs={'class':'top'}).h1.a.string
                except:
                  strip_title = 'Error - no Title found'
                try:
                  date_title = page_soup.find('ul', attrs={'class': 'feature-nav'}).li.string
                  if not date_title:
                      date_title = page_soup.find('ul', attrs={'class': 'feature-nav'}).li.string
                except:
                  date_title = 'Error - no Date found'
                title = strip_title + ' - ' + date_title
                for i in range(2):
                  try:
                    strip_url_date = page_soup.find(name='div', attrs={'class':'top'}).h1.a['href']
                    break #success - this is normal exit
                  except:
                    strip_url_date = None
                    continue #try to get strip_url_date again
                for i in range(2):
                  try:
                    prev_strip_url_date = page_soup.find('a', attrs={'class': 'prev'})['href']
                    break #success - this is normal exit
                  except:
                    prev_strip_url_date = None
                    continue #try to get prev_strip_url_date again
                if strip_url_date:
                  page_url = 'http://www.gocomics.com' + strip_url_date
                else:
                  continue
                if prev_strip_url_date:
                  prev_page_url = 'http://www.gocomics.com' + prev_strip_url_date
                else:
                  continue
            current_articles.append({'title': title, 'url': page_url, 'description':'', 'date':''})
            url = prev_page_url
        current_articles.reverse()
        return current_articles


I think you'll want to carefully look at your exact error. In my case, I had trouble understanding what was failing. I would get an error that an element on the page wasn't found, the recipe would bomb, then I'd print the soup, and I'd find that element. It seemed to be cured with the code above. In your case, you may need to do the page fetch multiple times. The code above particularly the "for i in range(2):" parts seem to have only fetched once and I vaguely recall puzzling why I couldn't find content that seemed to be there, so I added some retries of the href find. In your case, it should be possible to add multiple fetches if that's needed.
Starson17 is offline   Reply With Quote
Advert
Old 04-12-2011, 07:41 AM   #6
Selcal
Member
Selcal began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Jul 2010
Device: PRS600 / Cybook Opus
I think I can adapt that for the fetch -- it's definitely the fetch that fails in my case. I'm on a dodgy connection now so it's the perfect time to try.
Selcal is offline   Reply With Quote
Old 04-13-2011, 08:46 AM   #7
Selcal
Member
Selcal began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Jul 2010
Device: PRS600 / Cybook Opus
Seems to work so far! I'll test it for awhile, if it stays stable I'll put the recipe up!

Thanks for your help here Starson17!

Any chance on the tag/title setting from the recipe?
Selcal is offline   Reply With Quote
Old 04-13-2011, 10:00 AM   #8
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by Selcal View Post
Any chance on the tag/title setting from the recipe?
I'm a bit embarrassed that I'm not sure about that, and hoped Kovid would answer it for both of us. I have seen it discussed, and to the best of my recollection (but don't rely on this):

The tags can't be set from inside the recipe. You can turn off the recipe name tag and add other tags from the GUI.

The title - I know you can kill the date part of the title, and I'm pretty sure you can change it to any date you want. There are recipes that do remove the date so that new versions replace old versions on the reader. I just didn't have the time to search the code to tell you how.

Of course, the recipes are so powerful,, that if you want to get really deep, you can usually do anything you want, so even the tags may be possible.

Sorry I can't help more. I'm a bit jammed up now with work.
Starson17 is offline   Reply With Quote
Old 04-13-2011, 10:49 AM   #9
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You cannot affect tags from within the recipe.

Set


timefmt = ''

to remove the date from the title.
kovidgoyal is offline   Reply With Quote
Old 04-18-2011, 05:17 AM   #10
Selcal
Member
Selcal began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Jul 2010
Device: PRS600 / Cybook Opus
Thanks! I've set that as an option in case the recipe is used to download a specific date. I'll try to get the tags they way I want in the GUI.

So far the recipe seems to work reliably. I'll test it for this whole week.
Selcal is offline   Reply With Quote
Old 04-27-2011, 12:07 PM   #11
Selcal
Member
Selcal began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Jul 2010
Device: PRS600 / Cybook Opus
Thumbs up

A week with no trouble! I'm happy to release this now. Python is not my first language, so to say, so any optimizations are welcome. I've tried to output relevant information so the job details will show what is happening.

File attached zipped, and visible here:
Spoiler:
Code:
from calibre import strftime
from calibre.ebooks.BeautifulSoup import BeautifulSoup
from BeautifulSoup import BeautifulStoneSoup
from calibre.web.feeds.news import BasicNewsRecipe

class Volkskrant_full(BasicNewsRecipe):
    # This recipe will download the Volkskrant newspaper,
    # from the subscribers site. It requires a password.
    # Known issues are: articles that are spread out over
    # multiple pages will appear multiple times. Pages
    # that contain only adverts will appear, but empty.
    # The supplement 'Volkskrant Magazine' on saturday
    # is currently not downloaded.
    # You can set a manual date, to download an archived
    # newspaper. Volkskrant stores over a month at the
    # moment of writing. To do so I suggest you unmark
    # the date on the line below, and insert it in the title. Then
    # follow the instructions marked further below.

    title = 'De Volkskrant' # [za, 13 nov 2010]'
    __author__ = u'Selcal'
    description = u"Volkskrant"
    oldest_article = 30
    max_articles_per_feed = 100
    no_stylesheets = True
    use_embedded_content = False
    simultaneous_downloads = 1
    delay = 1
    needs_subscription = True
    # Set RETRIEVEDATE to 'yyyymmdd' to load an older
    # edition. Otherwise keep '%Y%m%d'
    # When setting a manual date, unmark and add the date
    # to the title above, and unmark the timefmt line to stop
    # Calibre from adding today's date in addition.

    # timefmt = ''
    RETRIEVEDATE = strftime('%Y%m%d')
    INDEX_MAIN = 'http://www.volkskrant.nl/vk-online/VK/' + RETRIEVEDATE + '___/VKN01_001/#text'
    INDEX_ARTICLE = 'http://www.volkskrant.nl/vk-online/VK/' + RETRIEVEDATE + '___/VKN01_001/'
    LOGIN = 'http://www.volkskrant.nl/vk/user/loggedIn.do'
    remove_tags = [dict(name='address')]
    cover_url = 'http://www.volkskrant.nl/vk-online/VK/' + RETRIEVEDATE + '___/VKN01_001/page.jpg'
	
    def get_browser(self):
        br = BasicNewsRecipe.get_browser()

        if self.username is not None and self.password is not None:
           br.open(self.LOGIN)
           br.select_form(nr = 0)
           br['username'] = self.username
           br['password'] = self.password
           br.submit()
        return br
        
    def parse_index(self):
        krant = []
        def strip_title(_title):
            i = 0 
            while ((_title[i] <> ":") and (i <= len(_title))): 
               i = i + 1
            return(_title[0:i])		     
        for temp in range (5):
              try:
                soup = self.index_to_soup(self.INDEX_MAIN)
                break
              except:
                print '(Retrying main index load)'
                continue
        mainsoup = soup.find('td', attrs={'id': 'select_page_top'})
        for option in mainsoup.findAll('option'):
           articles = []
           _INDEX = 'http://www.volkskrant.nl/vk-online/VK/' + self.RETRIEVEDATE + '___/' + option['value'] + '/#text'
           _INDEX_ARTICLE = 'http://www.volkskrant.nl/vk-online/VK/' + self.RETRIEVEDATE + '___/' + option['value'] + '/'
           print ''
           print '<-------    Processing section: ' + _INDEX + ' ------------------------->'
           for temp in range (5):
              try:
                soup = self.index_to_soup(_INDEX)
                break
              except:
                print '(Retrying index load)'
                continue
           for item in soup.findAll('area'):
              art_nr = item['class']
              attrname = art_nr[0:12] + '_section' + option['value'][0:5] + '_' + art_nr[26:len(art_nr)]
              print '==> Found: ' + attrname;
              index_title = soup.find('div', attrs={'class': attrname})
              get_title = index_title['title'];
              _ARTICLE   = _INDEX_ARTICLE + attrname + '.html#text'
              title = get_title;
              print '--> Title: ' + title;
              print '--> URL: ' + _ARTICLE;
              for temp in range (5):
                 try:
                   souparticle =  self.index_to_soup(_ARTICLE);
                   break
                 except:
                   print '(Retrying URL load)'
                   continue
              headerurl = souparticle.findAll('frame')[0]['src'];
              print '--> Read frame name for header: ' + headerurl;
              url = _INDEX_ARTICLE + headerurl[0:len(headerurl)-12] + '_text.html';
              print '--> Corrected URL: ' + url;
              if (get_title <> ''):
                 title = strip_title(get_title)
                 date  = strftime(' %B %Y')
              if (title <> ''):
                 articles.append({
                                         'title'      :title
                                        ,'date'       :date
                                        ,'url'        :url
                                        ,'description':''
                                        })
           krant.append( (option.string, articles))
        return krant
Attached Files
File Type: zip De Volkskrant.zip (1.8 KB, 140 views)
Selcal is offline   Reply With Quote
Old 05-05-2011, 02:09 AM   #12
ruudh
Member
ruudh began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Mar 2011
Location: Amsterdam
Device: Kindle3
Thanks!

Just wanted to thank Selcal for a great piece of work!
Finally I'm able to read my favorite newspaper on my Kindle3.
Any chance you could do a subscription version of NRC Handelsblad as well?
ruudh is offline   Reply With Quote
Old 05-09-2011, 04:10 AM   #13
Selcal
Member
Selcal began at the beginning.
 
Posts: 16
Karma: 10
Join Date: Jul 2010
Device: PRS600 / Cybook Opus
Glad to have been of help .

As for the NRC, I have no idea. I haven't got a subscription to that, and I don't know how the subscribers site is set up. So without being a subscriber it's pretty much impossible I'm afraid...
Selcal is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Recipe works when mocked up as Python file, fails when converted to Recipe ode Recipes 7 09-04-2011 04:57 AM
Need some help here, my 360's display's got the hiccups Francesco PocketBook 4 08-15-2010 03:16 PM
Wireless internet connection frustrating IDS connection Socrates iRex 8 10-21-2009 12:46 PM
First connection Peregrine Sony Reader 5 11-02-2006 05:29 PM
Possible E-Mail hiccups at Mobileread Alexander Turcic Announcements 1 10-30-2005 07:15 AM


All times are GMT -4. The time now is 06:20 AM.


MobileRead.com is a privately owned, operated and funded community.