04-05-2011, 05:13 AM | #1 |
Member
Posts: 16
Karma: 10
Join Date: Jul 2010
Device: PRS600 / Cybook Opus
|
[Updated] Recipe release for De Volkskrant
--Hi all,
For some time I've been working on a recipe to download the Dutch newspaper de Volkskrant, for subscribers (password needed). I've managed to get it working now (of course, right in the middle of my work they overhauled the site :S), but I still stumble across some problems now and then. The main problem is connection problems. These occur every now and then from home, but they occur frequently when accessing from abroad on a shabby connection, it's a Proxy error: Code:
Python function terminated unexpectedly HTTP Error 502: Proxy Error (Error Code: 1) Traceback (most recent call last): File "site.py", line 103, in main File "site.py", line 85, in run_entry_point File "site-packages\calibre\utils\ipc\worker.py", line 110, in main File "site-packages\calibre\gui2\convert\gui_conversion.py", line 25, in gui_convert File "site-packages\calibre\ebooks\conversion\plumber.py", line 904, in run File "site-packages\calibre\customize\conversion.py", line 204, in __call__ File "site-packages\calibre\web\feeds\input.py", line 105, in convert File "site-packages\calibre\web\feeds\news.py", line 734, in download File "site-packages\calibre\web\feeds\news.py", line 871, in build_index File "c:\users\mediac~1\appdata\local\temp\calibre_0.7.45_tmp_2sa8fy\calibre_0.7.45_pxqlpu_recipes\recipe0.py", line 55, in parse_index soup = self.index_to_soup(_INDEX) File "site-packages\calibre\web\feeds\news.py", line 495, in index_to_soup File "site-packages\mechanize-0.2.4-py2.7.egg\mechanize\_mechanize.py", line 199, in open_novisit File "site-packages\mechanize-0.2.4-py2.7.egg\mechanize\_mechanize.py", line 255, in _mech_open mechanize._response.httperror_seek_wrapper: HTTP Error 502: Proxy Error Is there a way to change a Calibre setting, or something in the recipe, to make it retry during the recipe download? It seems a waste of time to simply give up after one error? Next, I can't find a way to set the recipe tags from the recipe. Right now it defaults to "De Volkskrant, News" and I actually don't want both there (on my Sony reader it will then show up in two categories which I don't want). Can it be set from the recipe? And finally, can I change the title for the output file from the recipe dynamically? Ie. if I download a different date (yesterdays newspaper for example), I would like the title to read "De Volkskrant" followed by the date of the newspaper... Todays date is of no value. These problems are in order of priority . Thanks for any help! Last edited by Selcal; 04-27-2011 at 02:18 PM. |
04-06-2011, 03:46 PM | #2 |
Member
Posts: 16
Karma: 10
Join Date: Jul 2010
Device: PRS600 / Cybook Opus
|
Ok, I understand I can use a try syntax or similar, but my knowledge of Python is too poor and I can't find out how to do it. Here's the code that produces the error:
Code:
def parse_index(self): krant = [] def strip_title(_title): i = 0 while ((_title[i] <> ":") and (i <= len(_title))): i = i + 1 return(_title[0:i]) soup = self.index_to_soup(self.INDEX_MAIN) mainsoup = soup.find('td', attrs={'id': 'select_page_top'}) for option in mainsoup.findAll('option'): articles = [] _INDEX = 'http://www.volkskrant.nl/vk-online/VK/' + self.RETRIEVEDATE + '___/' + option['value'] + '/#text' _INDEX_ARTICLE = 'http://www.volkskrant.nl/vk-online/VK/' + self.RETRIEVEDATE + '___/' + option['value'] + '/' print '' print '<------- Processing section: ' + _INDEX + ' ------------------------->' soup = self.index_to_soup(_INDEX) for item in soup.findAll('area'): art_nr = item['class'] attrname = art_nr[0:12] + '_section' + option['value'][0:5] + '_' + art_nr[26:len(art_nr)] print '==> Found: ' + attrname; index_title = soup.find('div', attrs={'class': attrname}) get_title = index_title['title']; _ARTICLE = _INDEX_ARTICLE + attrname + '.html#text' title = get_title; print '--> Title: ' + title; print '--> URL: ' + _ARTICLE; souparticle = self.index_to_soup(_ARTICLE); headerurl = souparticle.findAll('frame')[0]['src']; print '--> Read frame name for header: ' + headerurl; url = _INDEX_ARTICLE + headerurl[0:len(headerurl)-12] + '_text.html'; print '--> Corrected URL: ' + url; if (get_title <> ''): title = strip_title(get_title) date = strftime(' %B %Y') if (title <> ''): articles.append({ 'title' :title ,'date' :date ,'url' :url ,'description':'' }) krant.append( (option.string, articles)) return krant |
Advert | |
|
04-08-2011, 02:34 PM | #3 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Code:
try: except: |
|
04-09-2011, 06:44 AM | #4 |
Member
Posts: 16
Karma: 10
Join Date: Jul 2010
Device: PRS600 / Cybook Opus
|
Thanks, that sounds like a good possibility. Can you give a quick example of how you use those statements? Online I can't find something that I can easily work into what I understand. Maybe you can give me a quick snip of the comic code where you use this?
Thanks for your help! |
04-11-2011, 10:49 AM | #5 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Spoiler:
I think you'll want to carefully look at your exact error. In my case, I had trouble understanding what was failing. I would get an error that an element on the page wasn't found, the recipe would bomb, then I'd print the soup, and I'd find that element. It seemed to be cured with the code above. In your case, you may need to do the page fetch multiple times. The code above particularly the "for i in range(2):" parts seem to have only fetched once and I vaguely recall puzzling why I couldn't find content that seemed to be there, so I added some retries of the href find. In your case, it should be possible to add multiple fetches if that's needed. |
|
Advert | |
|
04-12-2011, 07:41 AM | #6 |
Member
Posts: 16
Karma: 10
Join Date: Jul 2010
Device: PRS600 / Cybook Opus
|
I think I can adapt that for the fetch -- it's definitely the fetch that fails in my case. I'm on a dodgy connection now so it's the perfect time to try.
|
04-13-2011, 08:46 AM | #7 |
Member
Posts: 16
Karma: 10
Join Date: Jul 2010
Device: PRS600 / Cybook Opus
|
Seems to work so far! I'll test it for awhile, if it stays stable I'll put the recipe up!
Thanks for your help here Starson17! Any chance on the tag/title setting from the recipe? |
04-13-2011, 10:00 AM | #8 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
I'm a bit embarrassed that I'm not sure about that, and hoped Kovid would answer it for both of us. I have seen it discussed, and to the best of my recollection (but don't rely on this):
The tags can't be set from inside the recipe. You can turn off the recipe name tag and add other tags from the GUI. The title - I know you can kill the date part of the title, and I'm pretty sure you can change it to any date you want. There are recipes that do remove the date so that new versions replace old versions on the reader. I just didn't have the time to search the code to tell you how. Of course, the recipes are so powerful,, that if you want to get really deep, you can usually do anything you want, so even the tags may be possible. Sorry I can't help more. I'm a bit jammed up now with work. |
04-13-2011, 10:49 AM | #9 |
creator of calibre
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You cannot affect tags from within the recipe.
Set timefmt = '' to remove the date from the title. |
04-18-2011, 05:17 AM | #10 |
Member
Posts: 16
Karma: 10
Join Date: Jul 2010
Device: PRS600 / Cybook Opus
|
Thanks! I've set that as an option in case the recipe is used to download a specific date. I'll try to get the tags they way I want in the GUI.
So far the recipe seems to work reliably. I'll test it for this whole week. |
04-27-2011, 12:07 PM | #11 |
Member
Posts: 16
Karma: 10
Join Date: Jul 2010
Device: PRS600 / Cybook Opus
|
A week with no trouble! I'm happy to release this now. Python is not my first language, so to say, so any optimizations are welcome. I've tried to output relevant information so the job details will show what is happening.
File attached zipped, and visible here: Spoiler:
|
05-05-2011, 02:09 AM | #12 |
Member
Posts: 1
Karma: 10
Join Date: Mar 2011
Location: Amsterdam
Device: Kindle3
|
Thanks!
Just wanted to thank Selcal for a great piece of work!
Finally I'm able to read my favorite newspaper on my Kindle3. Any chance you could do a subscription version of NRC Handelsblad as well? |
05-09-2011, 04:10 AM | #13 |
Member
Posts: 16
Karma: 10
Join Date: Jul 2010
Device: PRS600 / Cybook Opus
|
Glad to have been of help .
As for the NRC, I have no idea. I haven't got a subscription to that, and I don't know how the subscribers site is set up. So without being a subscriber it's pretty much impossible I'm afraid... |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Recipe works when mocked up as Python file, fails when converted to Recipe | ode | Recipes | 7 | 09-04-2011 04:57 AM |
Need some help here, my 360's display's got the hiccups | Francesco | PocketBook | 4 | 08-15-2010 03:16 PM |
Wireless internet connection frustrating IDS connection | Socrates | iRex | 8 | 10-21-2009 12:46 PM |
First connection | Peregrine | Sony Reader | 5 | 11-02-2006 05:29 PM |
Possible E-Mail hiccups at Mobileread | Alexander Turcic | Announcements | 1 | 10-30-2005 07:15 AM |