![]() |
#1 |
Junior Member
![]() Posts: 6
Karma: 10
Join Date: Jul 2011
Device: Kindle
|
Washington Post Comics Recipe
I've just started using Calibre and found that it is pretty freakin' awesome!
I'm trying to get the Washington Post Comics recipe to work. If the comic is from Uclick (e.g. http://www.uclick.com/client/wpc/dt/) the comic strip will download properly. However if it is a King Syndicate link (e.g. http://www.washingtonpost.com/wp-srv...html?name=Zits) the comic strip fails to download. I've looked at the recipe code. Mind you my knowledge of Python was zero before today, so I'm struggling a little. From what I can tell, the recipe uses Beautiful Soup to find the select tag for the date of the comic strip in the HTML from the comics page on Washington Post website. It looks at the name of the select tag to determine a course of action. For the UClick comics, the name of the select tag is "url" and the recipe handles this fine and the comic strip is downloaded. It looks as though the Washington Post has changed the format of the links to non-Uclick comics since the recipe was written. Instead of something like "http://www.creators.com/featurepages/11_editorialcartoons_mike-luckovich.html?name=lk" it is now something like "http://www.washingtonpost.com/wp-srv/artsandliving/comics/king_zits.html?name=Zits". This new page has a lot of javascript going and doig a "view source" reveals nothing of the form elements. However, in Chrome, "Inspect Element" shows that the name of the select tag is "dest", which the recipe should be able to handle. In addition the values for the options are in the form "July 1, 2011". The cartoonCandidatesCreatorsCom() method looks like it should be able to handle the date in that format. However this reaches the limit of my Python skills. I don't know how to use the debug mode to step through the recipe. So, can anyone create a fix for this, or at least provide some guidance? Thanks - Rob BTW, the recipe code is: Code:
from calibre.web.feeds.news import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup from datetime import date, timedelta class WaPoCartoonsRecipe(BasicNewsRecipe): __license__ = 'GPL v3' __author__ = 'kwetal' language = 'en' version = 2 title = u'Washington Post Cartoons' publisher = u'Washington Post' category = u'News, Cartoons' description = u'Cartoons from the Washington Post' oldest_article = 1 max_articles_per_feed = 100 use_embedded_content = False no_stylesheets = True feeds = [] feeds.append((u'Dilbert', u'http://www.uclick.com/client/wpc/dt/')) feeds.append((u'Mutts', u'http://www.washingtonpost.com/wp-srv/artsandliving/comics/king_mutts.html?name=Mutts')) feeds.append((u'Sally Forth', u'http://www.washingtonpost.com/wp-srv/artsandliving/comics/king_sally_forth.html?name=Sally_Forth')) feeds.append((u'Shermans Lagoon', u'http://www.washingtonpost.com/wp-srv/artsandliving/comics/king_shermans_lagoon.html?name=Shermans_Lagoon')) feeds.append((u'Zits', u'http://www.washingtonpost.com/wp-srv/artsandliving/comics/king_zits.html?name=Zits')) feeds.append((u'Baby Blues', u'http://www.washingtonpost.com/wp-srv/artsandliving/comics/king_baby_blues.html?name=Baby_Blues')) feeds.append((u'Barney And Clyde', u'http://www.washingtonpost.com/wp-srv/artsandliving/comics/barney_clyde.html?name=Barney_Clyde')) extra_css = ''' body {font-family: verdana, arial, helvetica, geneva, sans-serif;} h1 {font-size: medium; font-weight: bold; margin-bottom: -0.1em; padding: 0em; text-align: left;} #name {margin-bottom: 0.2em} #copyright {font-size: xx-small; color: #696969; text-align: right; margin-top: 0.2em;} ''' def parse_index(self): index = [] oldestDate = date.today() - timedelta(days = self.oldest_article) oldest = oldestDate.strftime('%Y%m%d') for feed in self.feeds: cartoons = [] soup = self.index_to_soup(feed[1]) cartoon = {'title': 'Current', 'date': None, 'url': feed[1], 'description' : ''} cartoons.append(cartoon) select = soup.find('select', attrs = {'name': ['url', 'dest']}) if select: cartoonCandidates = [] if select['name'] == 'url': cartoonCandidates = self.cartoonCandidatesWaPo(select, oldest) else: cartoonCandidates = self.cartoonCandidatesCreatorsCom(select, oldest) for cartoon in cartoonCandidates: cartoons.append(cartoon) index.append([feed[0], cartoons]) return index def preprocess_html(self, soup): freshSoup = self.getFreshSoup(soup) div = soup.find('div', attrs = {'id': 'name'}) if div: freshSoup.body.append(div) comic = soup.find('div', attrs = {'id': 'comic_full'}) img = comic.find('img') if '&' in img['src']: img['src'], sep, bad = img['src'].rpartition('&') freshSoup.body.append(comic) freshSoup.body.append(soup.find('div', attrs = {'id': 'copyright'})) else: span = soup.find('span', attrs = {'class': 'title'}) if span: del span['class'] span['id'] = 'name' span.name = 'div' freshSoup.body.append(span) img = soup.find('img', attrs = {'class': 'pic_big'}) if img: td = img.parent if td.has_key('style'): del td['style'] td.name = 'div' td['id'] = 'comic_full' freshSoup.body.append(td) td = soup.find('td', attrs = {'class': 'copy'}) if td: for a in td.find('a'): a.extract() del td['class'] td['id'] = 'copyright' td.name = 'div' freshSoup.body.append(td) return freshSoup def getFreshSoup(self, oldSoup): freshSoup = BeautifulSoup('<html><head><title></title></head><body></body></html>') if oldSoup.head.title: freshSoup.head.title.append(self.tag_to_string(oldSoup.head.title)) return freshSoup def cartoonCandidatesWaPo(self, select, oldest): opts = select.findAll('option') for i in range(1, len(opts)): url = opts[i]['value'].rstrip('/') dateparts = url.split('/')[-3:] datenum = str(dateparts[0]) + str(dateparts[1]) + str(dateparts[2]) if datenum >= oldest: yield {'title': self.tag_to_string(opts[i]), 'date': None, 'url': url, 'description': ''} else: return def cartoonCandidatesCreatorsCom(self, select, oldest): monthNames = {'January': '01', 'February': '02', 'March': '03', 'April': '04', 'May': '05', 'June': '06', 'July': '07', 'August': '08', 'September': '09', 'October': '10', 'November': '11', 'December': '12'} opts = select.findAll('option') for i in range(1, len(opts)): if opts[i].has_key('selected'): continue dateString = self.tag_to_string(opts[i]) rest, sep, year = dateString.rpartition(', ') parts = rest.split(' ') day = parts[2].rjust(2, '0') month = monthNames[parts[1]] datenum = str(year) + month + str(day) if datenum >= oldest: yield {'title': dateString, 'date': None, 'url': opts[i]['value'], 'description': ''} else: return |
![]() |
![]() |
![]() |
#2 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Stick some print statements like this into the recipe: print 'the variable I want to track is: ', variable It will show in the job, as the recipe runs or you can use: calibre-debug -g to start calibre and see the results there. Even better is to use the methods described here: http://calibre-ebook.com/user_manual/news.html#news |
|
![]() |
![]() |
![]() |
#3 |
Junior Member
![]() Posts: 6
Karma: 10
Join Date: Jul 2011
Device: Kindle
|
Cool. Thanks for the tips. Looking at the debug output is seems the error is in preprocess_html:
File "/tmp/calibre_0.7.44_tmp_KPT4D0/calibre_0.7.44_yeObDG_recipes/recipe0.py", line 75, in preprocess_html if '&' in img['src']: TypeError: 'NoneType' object is not subscriptable My hunch is that it has something to do with the comic strip image being buried in a million levels of divs. I'll hack some more at it... Rob |
![]() |
![]() |
![]() |
#4 |
Junior Member
![]() Posts: 6
Karma: 10
Join Date: Jul 2011
Device: Kindle
|
Looking at this further, I don't if it is possible to grab the King Features comics via WashingtonPost.com. The URL for each of the King comics is populated via javascript when the page is loaded in the browser. If you do a "View Source" you'll only see javascript where the img tag ought to be. Likewise with a "wget".
The problem is that Beautiful Soup can't find an IMG tag, and so the recipe fails . As above I inspected the IMG url in Chrome, and did see the URL. So I tried just using that URL to get the comic. No dice. The web page at King Features requires that there be an allowed referring page. So at this point I'm going to cut my losses and look for another way outside of washingtonpost.com. |
![]() |
![]() |
![]() |
#5 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
You can control the referring page/header. I took a quick look, and it looked like the URL for each comic could either be calculated from the date, or extracted from the home page for the comic. I didn't go far enough to trigger the requirement for the referer header, but if you can use the same referring page for each comic, it should be possible to get past this issue.
|
![]() |
![]() |
![]() |
#6 |
Junior Member
![]() Posts: 6
Karma: 10
Join Date: Jul 2011
Device: Kindle
|
Hmm.. OK. I'll have another stab at it...
|
![]() |
![]() |
![]() |
#7 |
Junior Member
![]() Posts: 6
Karma: 10
Join Date: Jul 2011
Device: Kindle
|
I'm still struggling a bit here. I still can't tease the actual URL of the comic image out of the HTML returned.
The URL I've been working with is for the Zits comic of the current day: http://www.washingtonpost.com/wp-srv...html?name=Zits The source HTML of this page includes the following where the image will go: Code:
<div id="comic_full"> <script>document.writeln(img)</script> </div> Code:
<div id="comic_full"> <script>document.writeln(img)</script> <img src="http://est.rbma.com/content/Zits"> </div> Code:
def get_browser(self): print "In get_browser" br = BasicNewsRecipe.get_browser() br.set_handle_refresh(False) url = ('http://www.washingtonpost.com/wp-srv/artsandliving/comics/king_zits.html?name=Zits') raw = br.open(url).read() print raw return br Any help is much appreciated. |
![]() |
![]() |
![]() |
#8 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
<img src="http://est.rbma.com/content/Zits"> That's never going to change. Why not use that? If it's because of the authorization failure, then you need to track down how to pass that test. Edit: BTW. IIRC, Zits is already available from one of my other comics recipes, isn't it? Last edited by Starson17; 07-14-2011 at 09:47 AM. |
|
![]() |
![]() |
![]() |
#9 |
Junior Member
![]() Posts: 6
Karma: 10
Join Date: Jul 2011
Device: Kindle
|
No worries. I thought I might have been missing something blindingly obvious.
So yes, it is the authorization failure that is gumming up the works. I'll poke at it as time permits to see if I can solve it. The only other comics recipe I could find was the GoComics recipe, which I have been using from the start. It has a lot of the comics I want to read, but not all. Since I like to read the Washington Post, the comics I like are all on the Washington Post web site, but the ones that are King Features have this referring page crap you have to wade through. Cheers - Rob EDIT: Haha, oops... just saw the recipe for arcamax in the similar threads below. I had tried comics.com with limited success, although I can't remember what that was... Last edited by joseelsegundo; 07-16-2011 at 03:06 PM. Reason: Problem between chair and keyboard |
![]() |
![]() |
![]() |
#10 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
|
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Washington Post recipe broken | ice-9 | Recipes | 5 | 03-20-2012 09:27 PM |
Comics.com Recipe | BRGriff | Recipes | 0 | 05-24-2011 10:41 AM |
New Recipe:Arcamax - Comics | Starson17 | Recipes | 17 | 05-16-2011 10:56 AM |
Washington Post Recipe problem | warshauer | Recipes | 9 | 11-21-2010 10:30 AM |
Recipe for Washington Post blog | oski24601 | Calibre | 1 | 11-25-2009 05:22 PM |