Washington Post Comics Recipe

joseelsegundo · 07-01-2011, 04:34 PM

I've just started using Calibre and found that it is pretty freakin' awesome!

I'm trying to get the Washington Post Comics recipe to work. If the comic is from Uclick (e.g. http://www.uclick.com/client/wpc/dt/) the comic strip will download properly. However if it is a King Syndicate link (e.g. http://www.washingtonpost.com/wp-srv...html?name=Zits) the comic strip fails to download.

I've looked at the recipe code. Mind you my knowledge of Python was zero before today, so I'm struggling a little. From what I can tell, the recipe uses Beautiful Soup to find the select tag for the date of the comic strip in the HTML from the comics page on Washington Post website. It looks at the name of the select tag to determine a course of action.

For the UClick comics, the name of the select tag is "url" and the recipe handles this fine and the comic strip is downloaded. It looks as though the Washington Post has changed the format of the links to non-Uclick comics since the recipe was written. Instead of something like "http://www.creators.com/featurepages/11_editorialcartoons_mike-luckovich.html?name=lk" it is now something like "http://www.washingtonpost.com/wp-srv/artsandliving/comics/king_zits.html?name=Zits". This new page has a lot of javascript going and doig a "view source" reveals nothing of the form elements. However, in Chrome, "Inspect Element" shows that the name of the select tag is "dest", which the recipe should be able to handle. In addition the values for the options are in the form "July 1, 2011". The cartoonCandidatesCreatorsCom() method looks like it should be able to handle the date in that format. However this reaches the limit of my Python skills. I don't know how to use the debug mode to step through the recipe.

So, can anyone create a fix for this, or at least provide some guidance?

Thanks -
Rob

BTW, the recipe code is:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup
from datetime import date, timedelta

class WaPoCartoonsRecipe(BasicNewsRecipe):
    __license__   = 'GPL v3'
    __author__ = 'kwetal'
    language = 'en'
    version = 2

    title = u'Washington Post Cartoons'
    publisher = u'Washington Post'
    category = u'News, Cartoons'
    description = u'Cartoons from the Washington Post'

    oldest_article = 1
    max_articles_per_feed = 100
    use_embedded_content = False
    no_stylesheets = True

    feeds = []
    feeds.append((u'Dilbert', u'http://www.uclick.com/client/wpc/dt/'))
    feeds.append((u'Mutts', u'http://www.washingtonpost.com/wp-srv/artsandliving/comics/king_mutts.html?name=Mutts'))
    feeds.append((u'Sally Forth', u'http://www.washingtonpost.com/wp-srv/artsandliving/comics/king_sally_forth.html?name=Sally_Forth'))
    feeds.append((u'Shermans Lagoon', u'http://www.washingtonpost.com/wp-srv/artsandliving/comics/king_shermans_lagoon.html?name=Shermans_Lagoon'))
    feeds.append((u'Zits', u'http://www.washingtonpost.com/wp-srv/artsandliving/comics/king_zits.html?name=Zits'))
    feeds.append((u'Baby Blues', u'http://www.washingtonpost.com/wp-srv/artsandliving/comics/king_baby_blues.html?name=Baby_Blues'))
    feeds.append((u'Barney And Clyde', u'http://www.washingtonpost.com/wp-srv/artsandliving/comics/barney_clyde.html?name=Barney_Clyde'))

    extra_css = '''
                body {font-family: verdana, arial, helvetica, geneva, sans-serif;}
                h1 {font-size: medium; font-weight: bold; margin-bottom: -0.1em; padding: 0em; text-align: left;}
                #name {margin-bottom: 0.2em}
                #copyright {font-size: xx-small; color: #696969; text-align: right; margin-top: 0.2em;}
                '''

    def parse_index(self):
        index = []
        oldestDate = date.today() - timedelta(days = self.oldest_article)
        oldest = oldestDate.strftime('%Y%m%d')
        for feed in self.feeds:
            cartoons = []
            soup = self.index_to_soup(feed[1])

            cartoon = {'title': 'Current', 'date': None, 'url': feed[1], 'description' : ''}
            cartoons.append(cartoon)

            select = soup.find('select', attrs = {'name': ['url', 'dest']})
            if select:
                cartoonCandidates = []
                if select['name'] == 'url':
                    cartoonCandidates = self.cartoonCandidatesWaPo(select, oldest)
                else:
                    cartoonCandidates = self.cartoonCandidatesCreatorsCom(select, oldest)

                for cartoon in cartoonCandidates:
                    cartoons.append(cartoon)

            index.append([feed[0], cartoons])

        return index

    def preprocess_html(self, soup):
        freshSoup = self.getFreshSoup(soup)

        div = soup.find('div', attrs = {'id': 'name'})
        if div:
            freshSoup.body.append(div)
            comic = soup.find('div', attrs = {'id': 'comic_full'})

            img = comic.find('img')
            if '&' in img['src']:
                img['src'], sep, bad = img['src'].rpartition('&')

            freshSoup.body.append(comic)
            freshSoup.body.append(soup.find('div', attrs = {'id': 'copyright'}))
        else:
            span = soup.find('span', attrs = {'class': 'title'})
            if span:
                del span['class']
                span['id'] = 'name'
                span.name = 'div'
                freshSoup.body.append(span)

            img = soup.find('img', attrs = {'class': 'pic_big'})
            if img:
                td = img.parent
                if td.has_key('style'):
                    del td['style']
                td.name = 'div'
                td['id'] = 'comic_full'
                freshSoup.body.append(td)

            td = soup.find('td', attrs = {'class': 'copy'})
            if td:
                for a in td.find('a'):
                    a.extract()
                del td['class']
                td['id'] = 'copyright'
                td.name = 'div'
                freshSoup.body.append(td)

        return freshSoup

    def getFreshSoup(self, oldSoup):
        freshSoup = BeautifulSoup('<html><head><title></title></head><body></body></html>')
        if oldSoup.head.title:
            freshSoup.head.title.append(self.tag_to_string(oldSoup.head.title))
        return freshSoup

    def cartoonCandidatesWaPo(self, select, oldest):
        opts = select.findAll('option')
        for i in range(1, len(opts)):
            url = opts[i]['value'].rstrip('/')
            dateparts = url.split('/')[-3:]
            datenum = str(dateparts[0]) + str(dateparts[1]) + str(dateparts[2])
            if datenum >= oldest:
                yield {'title': self.tag_to_string(opts[i]), 'date': None, 'url': url, 'description': ''}
            else:
                return

    def cartoonCandidatesCreatorsCom(self, select, oldest):
        monthNames = {'January': '01', 'February': '02', 'March': '03', 'April': '04', 'May': '05',
                      'June': '06', 'July': '07', 'August': '08', 'September': '09', 'October': '10',
                      'November': '11', 'December': '12'}

        opts = select.findAll('option')
        for i in range(1, len(opts)):
            if opts[i].has_key('selected'):
                continue

            dateString = self.tag_to_string(opts[i])
            rest, sep, year = dateString.rpartition(', ')
            parts = rest.split(' ')
            day = parts[2].rjust(2, '0')
            month = monthNames[parts[1]]
            datenum = str(year) + month + str(day)
            if datenum >= oldest:
                yield {'title': dateString, 'date': None, 'url': opts[i]['value'], 'description': ''}
            else:
                return

Starson17 · 07-01-2011, 04:40 PM

Quote:

Originally Posted by joseelsegundo

I've just started using Calibre and found that it is pretty freakin' awesome!

Yes, it's awesome!

Stick some print statements like this into the recipe:

print 'the variable I want to track is: ', variable

It will show in the job, as the recipe runs or you can use:
calibre-debug -g
to start calibre and see the results there.

Even better is to use the methods described here:
http://calibre-ebook.com/user_manual/news.html#news

joseelsegundo · 07-01-2011, 06:04 PM

Cool. Thanks for the tips. Looking at the debug output is seems the error is in preprocess_html:

File "/tmp/calibre_0.7.44_tmp_KPT4D0/calibre_0.7.44_yeObDG_recipes/recipe0.py", line 75, in preprocess_html
if '&' in img['src']:
TypeError: 'NoneType' object is not subscriptable

My hunch is that it has something to do with the comic strip image being buried in a million levels of divs. I'll hack some more at it...

Rob

joseelsegundo · 07-08-2011, 09:15 AM

Looking at this further, I don't if it is possible to grab the King Features comics via WashingtonPost.com. The URL for each of the King comics is populated via javascript when the page is loaded in the browser. If you do a "View Source" you'll only see javascript where the img tag ought to be. Likewise with a "wget".

The problem is that Beautiful Soup can't find an IMG tag, and so the recipe fails .

As above I inspected the IMG url in Chrome, and did see the URL. So I tried just using that URL to get the comic. No dice. The web page at King Features requires that there be an allowed referring page.

So at this point I'm going to cut my losses and look for another way outside of washingtonpost.com.

Starson17 · 07-08-2011, 10:03 PM

Quote:

Originally Posted by joseelsegundo

The web page at King Features requires that there be an allowed referring page.

You can control the referring page/header. I took a quick look, and it looked like the URL for each comic could either be calculated from the date, or extracted from the home page for the comic. I didn't go far enough to trigger the requirement for the referer header, but if you can use the same referring page for each comic, it should be possible to get past this issue.

joseelsegundo · 07-11-2011, 05:38 PM

Hmm.. OK. I'll have another stab at it...

joseelsegundo · 07-13-2011, 09:51 PM

I'm still struggling a bit here. I still can't tease the actual URL of the comic image out of the HTML returned.

The URL I've been working with is for the Zits comic of the current day:
http://www.washingtonpost.com/wp-srv...html?name=Zits

The source HTML of this page includes the following where the image will go:

Code:

<div id="comic_full"> <script>document.writeln(img)</script> </div>

When I use "Inspect element" from my Chrome browser I see that this gets changed to:

Code:

<div id="comic_full">
<script>document.writeln(img)</script>
<img src="http://est.rbma.com/content/Zits">
</div>

With my recipe, I've tried getting the HTML via the index_to_soup() method and I've grabbing the HTML using the mechanize browser:

Code:

def get_browser(self):
	print "In get_browser"
        br = BasicNewsRecipe.get_browser()
        br.set_handle_refresh(False)
        url = ('http://www.washingtonpost.com/wp-srv/artsandliving/comics/king_zits.html?name=Zits')
        raw = br.open(url).read()
        print raw
        return br

In each case, I never get the actual URL of the image. So right now I have absolutely no idea where to go from here. I've pored through the documentation and APIs and I see no way to make this work.

Any help is much appreciated.

Starson17 · 07-14-2011, 09:45 AM

Quote:

Originally Posted by joseelsegundo

When I use "Inspect element" from my Chrome browser I see that this gets changed to:

Code:

<div id="comic_full">
<script>document.writeln(img)</script>
<img src="http://est.rbma.com/content/Zits">
</div>

In each case, I never get the actual URL of the image.

You have the URL right there:
<img src="http://est.rbma.com/content/Zits">
That's never going to change. Why not use that?

If it's because of the authorization failure, then you need to track down how to pass that test.
Edit: BTW. IIRC, Zits is already available from one of my other comics recipes, isn't it?

joseelsegundo · 07-16-2011, 03:01 PM

No worries. I thought I might have been missing something blindingly obvious.

So yes, it is the authorization failure that is gumming up the works. I'll poke at it as time permits to see if I can solve it.

The only other comics recipe I could find was the GoComics recipe, which I have been using from the start. It has a lot of the comics I want to read, but not all. Since I like to read the Washington Post, the comics I like are all on the Washington Post web site, but the ones that are King Features have this referring page crap you have to wade through.

Cheers -
Rob

EDIT: Haha, oops... just saw the recipe for arcamax in the similar threads below. I had tried comics.com with limited success, although I can't remember what that was...

Starson17 · 07-16-2011, 04:15 PM

Quote:

Originally Posted by joseelsegundo

No worries. I thought I might have been missing something blindingly obvious.

So yes, it is the authorization failure that is gumming up the works. I'll poke at it as time permits to see if I can solve it.

The only other comics recipe I could find was the GoComics recipe, which I have been using from the start. It has a lot of the comics I want to read, but not all. Since I like to read the Washington Post, the comics I like are all on the Washington Post web site, but the ones that are King Features have this referring page crap you have to wade through.

Cheers -
Rob

EDIT: Haha, oops... just saw the recipe for arcamax in the similar threads below. I had tried comics.com with limited success, although I can't remember what that was...

Yes, Gocomics, Arcamax and Comics.com are my comics trio (now a duo). If you're going to keep trying, TamperData is the plugin for FireFox I'd recommend to work on the authorization crap. If you come to a dead end, let me know. I might have time to take a look at it in a while.

07-13-2011, 09:51 PM	#7
joseelsegundo Junior Member Posts: 6 Karma: 10 Join Date: Jul 2011 Device: Kindle	I'm still struggling a bit here. I still can't tease the actual URL of the comic image out of the HTML returned. The URL I've been working with is for the Zits comic of the current day: http://www.washingtonpost.com/wp-srv...html?name=Zits The source HTML of this page includes the following where the image will go: Code: <div id="comic_full"> <script>document.writeln(img)</script> </div> When I use "Inspect element" from my Chrome browser I see that this gets changed to: Code: <div id="comic_full"> <script>document.writeln(img)</script> <img src="http://est.rbma.com/content/Zits"> </div> With my recipe, I've tried getting the HTML via the index_to_soup() method and I've grabbing the HTML using the mechanize browser: Code: def get_browser(self): print "In get_browser" br = BasicNewsRecipe.get_browser() br.set_handle_refresh(False) url = ('http://www.washingtonpost.com/wp-srv/artsandliving/comics/king_zits.html?name=Zits') raw = br.open(url).read() print raw return br In each case, I never get the actual URL of the image. So right now I have absolutely no idea where to go from here. I've pored through the documentation and APIs and I see no way to make this work. Any help is much appreciated.

07-16-2011, 03:01 PM	#9
joseelsegundo Junior Member Posts: 6 Karma: 10 Join Date: Jul 2011 Device: Kindle	No worries. I thought I might have been missing something blindingly obvious. So yes, it is the authorization failure that is gumming up the works. I'll poke at it as time permits to see if I can solve it. The only other comics recipe I could find was the GoComics recipe, which I have been using from the start. It has a lot of the comics I want to read, but not all. Since I like to read the Washington Post, the comics I like are all on the Washington Post web site, but the ones that are King Features have this referring page crap you have to wade through. Cheers - Rob EDIT: Haha, oops... just saw the recipe for arcamax in the similar threads below. I had tried comics.com with limited success, although I can't remember what that was... Last edited by joseelsegundo; 07-16-2011 at 03:06 PM. Reason: Problem between chair and keyboard

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Washington Post recipe broken	ice-9	Recipes	5	03-20-2012 09:27 PM
Comics.com Recipe	BRGriff	Recipes	0	05-24-2011 10:41 AM
New Recipe:Arcamax - Comics	Starson17	Recipes	17	05-16-2011 10:56 AM
Washington Post Recipe problem	warshauer	Recipes	9	11-21-2010 10:30 AM
Recipe for Washington Post blog	oski24601	Calibre	1	11-25-2009 05:22 PM

07-01-2011, 06:04 PM	#3
joseelsegundo Junior Member Posts: 6 Karma: 10 Join Date: Jul 2011 Device: Kindle	Cool. Thanks for the tips. Looking at the debug output is seems the error is in preprocess_html: File "/tmp/calibre_0.7.44_tmp_KPT4D0/calibre_0.7.44_yeObDG_recipes/recipe0.py", line 75, in preprocess_html if '&' in img['src']: TypeError: 'NoneType' object is not subscriptable My hunch is that it has something to do with the comic strip image being buried in a million levels of divs. I'll hack some more at it... Rob

07-08-2011, 09:15 AM	#4
joseelsegundo Junior Member Posts: 6 Karma: 10 Join Date: Jul 2011 Device: Kindle	Looking at this further, I don't if it is possible to grab the King Features comics via WashingtonPost.com. The URL for each of the King comics is populated via javascript when the page is loaded in the browser. If you do a "View Source" you'll only see javascript where the img tag ought to be. Likewise with a "wget". The problem is that Beautiful Soup can't find an IMG tag, and so the recipe fails . As above I inspected the IMG url in Chrome, and did see the URL. So I tried just using that URL to get the comic. No dice. The web page at King Features requires that there be an allowed referring page. So at this point I'm going to cut my losses and look for another way outside of washingtonpost.com.

07-11-2011, 05:38 PM	#6
joseelsegundo Junior Member Posts: 6 Karma: 10 Join Date: Jul 2011 Device: Kindle	Hmm.. OK. I'll have another stab at it...