Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 07-01-2011, 04:34 PM   #1
joseelsegundo
Junior Member
joseelsegundo began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Jul 2011
Device: Kindle
Washington Post Comics Recipe

I've just started using Calibre and found that it is pretty freakin' awesome!

I'm trying to get the Washington Post Comics recipe to work. If the comic is from Uclick (e.g. http://www.uclick.com/client/wpc/dt/) the comic strip will download properly. However if it is a King Syndicate link (e.g. http://www.washingtonpost.com/wp-srv...html?name=Zits) the comic strip fails to download.

I've looked at the recipe code. Mind you my knowledge of Python was zero before today, so I'm struggling a little. From what I can tell, the recipe uses Beautiful Soup to find the select tag for the date of the comic strip in the HTML from the comics page on Washington Post website. It looks at the name of the select tag to determine a course of action.

For the UClick comics, the name of the select tag is "url" and the recipe handles this fine and the comic strip is downloaded. It looks as though the Washington Post has changed the format of the links to non-Uclick comics since the recipe was written. Instead of something like "http://www.creators.com/featurepages/11_editorialcartoons_mike-luckovich.html?name=lk" it is now something like "http://www.washingtonpost.com/wp-srv/artsandliving/comics/king_zits.html?name=Zits". This new page has a lot of javascript going and doig a "view source" reveals nothing of the form elements. However, in Chrome, "Inspect Element" shows that the name of the select tag is "dest", which the recipe should be able to handle. In addition the values for the options are in the form "July 1, 2011". The cartoonCandidatesCreatorsCom() method looks like it should be able to handle the date in that format. However this reaches the limit of my Python skills. I don't know how to use the debug mode to step through the recipe.

So, can anyone create a fix for this, or at least provide some guidance?

Thanks -
Rob

BTW, the recipe code is:
Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup
from datetime import date, timedelta

class WaPoCartoonsRecipe(BasicNewsRecipe):
    __license__   = 'GPL v3'
    __author__ = 'kwetal'
    language = 'en'
    version = 2

    title = u'Washington Post Cartoons'
    publisher = u'Washington Post'
    category = u'News, Cartoons'
    description = u'Cartoons from the Washington Post'

    oldest_article = 1
    max_articles_per_feed = 100
    use_embedded_content = False
    no_stylesheets = True

    feeds = []
    feeds.append((u'Dilbert', u'http://www.uclick.com/client/wpc/dt/'))
    feeds.append((u'Mutts', u'http://www.washingtonpost.com/wp-srv/artsandliving/comics/king_mutts.html?name=Mutts'))
    feeds.append((u'Sally Forth', u'http://www.washingtonpost.com/wp-srv/artsandliving/comics/king_sally_forth.html?name=Sally_Forth'))
    feeds.append((u'Shermans Lagoon', u'http://www.washingtonpost.com/wp-srv/artsandliving/comics/king_shermans_lagoon.html?name=Shermans_Lagoon'))
    feeds.append((u'Zits', u'http://www.washingtonpost.com/wp-srv/artsandliving/comics/king_zits.html?name=Zits'))
    feeds.append((u'Baby Blues', u'http://www.washingtonpost.com/wp-srv/artsandliving/comics/king_baby_blues.html?name=Baby_Blues'))
    feeds.append((u'Barney And Clyde', u'http://www.washingtonpost.com/wp-srv/artsandliving/comics/barney_clyde.html?name=Barney_Clyde'))

    extra_css = '''
                body {font-family: verdana, arial, helvetica, geneva, sans-serif;}
                h1 {font-size: medium; font-weight: bold; margin-bottom: -0.1em; padding: 0em; text-align: left;}
                #name {margin-bottom: 0.2em}
                #copyright {font-size: xx-small; color: #696969; text-align: right; margin-top: 0.2em;}
                '''

    def parse_index(self):
        index = []
        oldestDate = date.today() - timedelta(days = self.oldest_article)
        oldest = oldestDate.strftime('%Y%m%d')
        for feed in self.feeds:
            cartoons = []
            soup = self.index_to_soup(feed[1])

            cartoon = {'title': 'Current', 'date': None, 'url': feed[1], 'description' : ''}
            cartoons.append(cartoon)

            select = soup.find('select', attrs = {'name': ['url', 'dest']})
            if select:
                cartoonCandidates = []
                if select['name'] == 'url':
                    cartoonCandidates = self.cartoonCandidatesWaPo(select, oldest)
                else:
                    cartoonCandidates = self.cartoonCandidatesCreatorsCom(select, oldest)

                for cartoon in cartoonCandidates:
                    cartoons.append(cartoon)

            index.append([feed[0], cartoons])

        return index

    def preprocess_html(self, soup):
        freshSoup = self.getFreshSoup(soup)

        div = soup.find('div', attrs = {'id': 'name'})
        if div:
            freshSoup.body.append(div)
            comic = soup.find('div', attrs = {'id': 'comic_full'})

            img = comic.find('img')
            if '&' in img['src']:
                img['src'], sep, bad = img['src'].rpartition('&')

            freshSoup.body.append(comic)
            freshSoup.body.append(soup.find('div', attrs = {'id': 'copyright'}))
        else:
            span = soup.find('span', attrs = {'class': 'title'})
            if span:
                del span['class']
                span['id'] = 'name'
                span.name = 'div'
                freshSoup.body.append(span)

            img = soup.find('img', attrs = {'class': 'pic_big'})
            if img:
                td = img.parent
                if td.has_key('style'):
                    del td['style']
                td.name = 'div'
                td['id'] = 'comic_full'
                freshSoup.body.append(td)

            td = soup.find('td', attrs = {'class': 'copy'})
            if td:
                for a in td.find('a'):
                    a.extract()
                del td['class']
                td['id'] = 'copyright'
                td.name = 'div'
                freshSoup.body.append(td)

        return freshSoup

    def getFreshSoup(self, oldSoup):
        freshSoup = BeautifulSoup('<html><head><title></title></head><body></body></html>')
        if oldSoup.head.title:
            freshSoup.head.title.append(self.tag_to_string(oldSoup.head.title))
        return freshSoup

    def cartoonCandidatesWaPo(self, select, oldest):
        opts = select.findAll('option')
        for i in range(1, len(opts)):
            url = opts[i]['value'].rstrip('/')
            dateparts = url.split('/')[-3:]
            datenum = str(dateparts[0]) + str(dateparts[1]) + str(dateparts[2])
            if datenum >= oldest:
                yield {'title': self.tag_to_string(opts[i]), 'date': None, 'url': url, 'description': ''}
            else:
                return

    def cartoonCandidatesCreatorsCom(self, select, oldest):
        monthNames = {'January': '01', 'February': '02', 'March': '03', 'April': '04', 'May': '05',
                      'June': '06', 'July': '07', 'August': '08', 'September': '09', 'October': '10',
                      'November': '11', 'December': '12'}

        opts = select.findAll('option')
        for i in range(1, len(opts)):
            if opts[i].has_key('selected'):
                continue

            dateString = self.tag_to_string(opts[i])
            rest, sep, year = dateString.rpartition(', ')
            parts = rest.split(' ')
            day = parts[2].rjust(2, '0')
            month = monthNames[parts[1]]
            datenum = str(year) + month + str(day)
            if datenum >= oldest:
                yield {'title': dateString, 'date': None, 'url': opts[i]['value'], 'description': ''}
            else:
                return
joseelsegundo is offline   Reply With Quote
Old 07-01-2011, 04:40 PM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by joseelsegundo View Post
I've just started using Calibre and found that it is pretty freakin' awesome!
Yes, it's awesome!

Stick some print statements like this into the recipe:

print 'the variable I want to track is: ', variable

It will show in the job, as the recipe runs or you can use:
calibre-debug -g
to start calibre and see the results there.

Even better is to use the methods described here:
http://calibre-ebook.com/user_manual/news.html#news
Starson17 is offline   Reply With Quote
Old 07-01-2011, 06:04 PM   #3
joseelsegundo
Junior Member
joseelsegundo began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Jul 2011
Device: Kindle
Cool. Thanks for the tips. Looking at the debug output is seems the error is in preprocess_html:

File "/tmp/calibre_0.7.44_tmp_KPT4D0/calibre_0.7.44_yeObDG_recipes/recipe0.py", line 75, in preprocess_html
if '&' in img['src']:
TypeError: 'NoneType' object is not subscriptable

My hunch is that it has something to do with the comic strip image being buried in a million levels of divs. I'll hack some more at it...

Rob
joseelsegundo is offline   Reply With Quote
Old 07-08-2011, 09:15 AM   #4
joseelsegundo
Junior Member
joseelsegundo began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Jul 2011
Device: Kindle
Looking at this further, I don't if it is possible to grab the King Features comics via WashingtonPost.com. The URL for each of the King comics is populated via javascript when the page is loaded in the browser. If you do a "View Source" you'll only see javascript where the img tag ought to be. Likewise with a "wget".

The problem is that Beautiful Soup can't find an IMG tag, and so the recipe fails .

As above I inspected the IMG url in Chrome, and did see the URL. So I tried just using that URL to get the comic. No dice. The web page at King Features requires that there be an allowed referring page.

So at this point I'm going to cut my losses and look for another way outside of washingtonpost.com.
joseelsegundo is offline   Reply With Quote
Old 07-08-2011, 10:03 PM   #5
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by joseelsegundo View Post
The web page at King Features requires that there be an allowed referring page.
You can control the referring page/header. I took a quick look, and it looked like the URL for each comic could either be calculated from the date, or extracted from the home page for the comic. I didn't go far enough to trigger the requirement for the referer header, but if you can use the same referring page for each comic, it should be possible to get past this issue.
Starson17 is offline   Reply With Quote
Old 07-11-2011, 05:38 PM   #6
joseelsegundo
Junior Member
joseelsegundo began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Jul 2011
Device: Kindle
Hmm.. OK. I'll have another stab at it...
joseelsegundo is offline   Reply With Quote
Old 07-13-2011, 09:51 PM   #7
joseelsegundo
Junior Member
joseelsegundo began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Jul 2011
Device: Kindle
I'm still struggling a bit here. I still can't tease the actual URL of the comic image out of the HTML returned.

The URL I've been working with is for the Zits comic of the current day:
http://www.washingtonpost.com/wp-srv...html?name=Zits

The source HTML of this page includes the following where the image will go:
Code:
<div id="comic_full"> <script>document.writeln(img)</script> </div>
When I use "Inspect element" from my Chrome browser I see that this gets changed to:
Code:
<div id="comic_full">
<script>document.writeln(img)</script>
<img src="http://est.rbma.com/content/Zits">
</div>
With my recipe, I've tried getting the HTML via the index_to_soup() method and I've grabbing the HTML using the mechanize browser:
Code:
def get_browser(self):
	print "In get_browser"
        br = BasicNewsRecipe.get_browser()
        br.set_handle_refresh(False)
        url = ('http://www.washingtonpost.com/wp-srv/artsandliving/comics/king_zits.html?name=Zits')
        raw = br.open(url).read()
        print raw
        return br
In each case, I never get the actual URL of the image. So right now I have absolutely no idea where to go from here. I've pored through the documentation and APIs and I see no way to make this work.

Any help is much appreciated.
joseelsegundo is offline   Reply With Quote
Old 07-14-2011, 09:45 AM   #8
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by joseelsegundo View Post
When I use "Inspect element" from my Chrome browser I see that this gets changed to:
Code:
<div id="comic_full">
<script>document.writeln(img)</script>
<img src="http://est.rbma.com/content/Zits">
</div>
In each case, I never get the actual URL of the image.
You have the URL right there:
<img src="http://est.rbma.com/content/Zits">
That's never going to change. Why not use that?

If it's because of the authorization failure, then you need to track down how to pass that test.
Edit: BTW. IIRC, Zits is already available from one of my other comics recipes, isn't it?

Last edited by Starson17; 07-14-2011 at 09:47 AM.
Starson17 is offline   Reply With Quote
Old 07-16-2011, 03:01 PM   #9
joseelsegundo
Junior Member
joseelsegundo began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Jul 2011
Device: Kindle
No worries. I thought I might have been missing something blindingly obvious.

So yes, it is the authorization failure that is gumming up the works. I'll poke at it as time permits to see if I can solve it.

The only other comics recipe I could find was the GoComics recipe, which I have been using from the start. It has a lot of the comics I want to read, but not all. Since I like to read the Washington Post, the comics I like are all on the Washington Post web site, but the ones that are King Features have this referring page crap you have to wade through.

Cheers -
Rob

EDIT: Haha, oops... just saw the recipe for arcamax in the similar threads below. I had tried comics.com with limited success, although I can't remember what that was...

Last edited by joseelsegundo; 07-16-2011 at 03:06 PM. Reason: Problem between chair and keyboard
joseelsegundo is offline   Reply With Quote
Old 07-16-2011, 04:15 PM   #10
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by joseelsegundo View Post
No worries. I thought I might have been missing something blindingly obvious.

So yes, it is the authorization failure that is gumming up the works. I'll poke at it as time permits to see if I can solve it.

The only other comics recipe I could find was the GoComics recipe, which I have been using from the start. It has a lot of the comics I want to read, but not all. Since I like to read the Washington Post, the comics I like are all on the Washington Post web site, but the ones that are King Features have this referring page crap you have to wade through.

Cheers -
Rob

EDIT: Haha, oops... just saw the recipe for arcamax in the similar threads below. I had tried comics.com with limited success, although I can't remember what that was...
Yes, Gocomics, Arcamax and Comics.com are my comics trio (now a duo). If you're going to keep trying, TamperData is the plugin for FireFox I'd recommend to work on the authorization crap. If you come to a dead end, let me know. I might have time to take a look at it in a while.
Starson17 is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Washington Post recipe broken ice-9 Recipes 5 03-20-2012 09:27 PM
Comics.com Recipe BRGriff Recipes 0 05-24-2011 10:41 AM
New Recipe:Arcamax - Comics Starson17 Recipes 17 05-16-2011 10:56 AM
Washington Post Recipe problem warshauer Recipes 9 11-21-2010 10:30 AM
Recipe for Washington Post blog oski24601 Calibre 1 11-25-2009 05:22 PM


All times are GMT -4. The time now is 12:08 AM.


MobileRead.com is a privately owned, operated and funded community.