Adding a comic strip to a newspaper's recipe

macpablus · 09-19-2011, 08:33 PM

I'd like to add a comic strip from the index page of a newspaper.

Until now, I manage to replace the cover with the comic image, using:

def get_cover_url

...but it would be nicer to have the comic inserted as an article.

In my recype, the articles are retrieved with parse_index.

Starson17 · 09-20-2011, 02:09 PM

Quote:

Originally Posted by macpablus

I'd like to add a comic strip from the index page of a newspaper.

Until now, I manage to replace the cover with the comic image, using:

def get_cover_url

...but it would be nicer to have the comic inserted as an article.

In my recype, the articles are retrieved with parse_index.

Parse_index builds a feed composed of multiple articles. Each article has a title and a link where the article content can be fetched. Modify parse_index to add the article you want with a title and link pointing to the comic strip.

macpablus · 09-20-2011, 11:53 PM

Quote:

Originally Posted by Starson17

Modify parse_index to add the article you want with a title and link pointing to the comic strip.

Just done something like that, but the article shows strange characters only. I guess that happens because the link points to an image instead of and HTML file. How do I solve that?

Starson17 · 09-21-2011, 04:30 PM

Quote:

Originally Posted by macpablus

Just done something like that, but the article shows strange characters only. I guess that happens because the link points to an image instead of and HTML file. How do I solve that?

If you post your recipe it would be easier to see what the problem is. You might review some of my comic recipes, such as Arcamax or Gocomics/Comics.com to see how articles and images interact. Basically, you want a link to an html page with an img tag on it that holds your strip. If the site doesn't have a page like that (it should, otherwise how do you see it) you can build it yourself in the recipe.

macpablus · 09-21-2011, 08:42 PM

Quote:

Originally Posted by Starson17

If you post your recipe it would be easier to see what the problem is.

All right, here it is:

Spoiler:

Quote:

Basically, you want a link to an html page with an img tag on it that holds your strip. If the site doesn't have a page like that (it should, otherwise how do you see it) you can build it yourself in the recipe.

The site has a page, of course, but it contains A LOT of "other things" that I'm not interested in (I mean, for my particular purpose). In fact, is the page that I'm using as the index for parsing the newspaper contents.

So, it seems that I should "build it myself" in the recipe...

a.peter · 09-22-2011, 01:03 PM

Okay. I see your problem.

In fact, the return value of parse_index(self) is:

Code:

[
 ('title', [
            {'title':..., 'url':..., 'description':..., 'date':...},
            More dictionaries as above ...
           ]
 ),
 More tuples with genres
]

The url has to be a HTML page.

On each of these pages, the values of remove_tags and so on are executed, resulting in a cleaned HTML-page.

A working example would be:

Spoiler:

macpablus · 09-22-2011, 05:59 PM

Quote:

Originally Posted by a.peter

On each of these pages, the values of remove_tags and so on are executed, resulting in a cleaned HTML-page.

Thanks, Peter.

The problem is that my complete recipe has other feeds (i.e, the content of the whole newspaper, with many different sections and articles), so the option keep_only_tags will affect each of the articles. :-(

a.peter · 09-23-2011, 02:16 AM

Quote:

Originally Posted by macpablus

The problem is that my complete recipe has other feeds (i.e, the content of the whole newspaper, with many different sections and articles), so the option keep_only_tags will affect each of the articles.

It's clear to me, that my recipe isn't complete. It was done to show you, that Calibre is expecting a HTML-page as URL. You passed the address of a GIF-image to calibre, which was interpredet as a HTML-page an produced the character garbage you've seen.

The good point is, that the keep_only_tags member is a list of dictionaries. You may add any other expression you need to parse other pages. If i take a look at an article, e. g. http://www.pagina12.com.ar/diario/el...011-09-22.html, i see that the actual article is embedded into a <div class="nota top12"> tag.

A modified keep_only_tags may be:

Code:

keep_only_tags = [dict(name='div', attrs={'id':'rudy_paz'}), dict(name='div', attrs={'class':'nota top12'})]

With this code, calibre will keep

all <div> with id="rudy_paz' AND
all <div> with class='nota top12'

It's no matter if they dont appear on the same page. But if you pass one page with the comic strip and a list of pages with articles, it will work on both of them.

By the way: For convenience, you may replace the second part of a dictionary entry of the keep_only_tags by a compiled regular expression, e. g. attrs={'class':re.compile('top.*')}

But don't forget to add a

Code:

import re

at the top of the recipe.

macpablus · 09-23-2011, 12:39 PM

Quote:

Originally Posted by a.peter

The good point is, that the keep_only_tags member is a list of dictionaries. You may add any other expression you need to parse other pages. If i take a look at an article, e. g. http://www.pagina12.com.ar/diario/el...011-09-22.html, i see that the actual article is embedded into a <div class="nota top12"> tag.

In fact, I'm using the print version for the articles:

http://www.pagina12.com.ar/imprimir/...011-09-22.html

The actual article is contained into this tag: <div id="cuerpo">.

But before this, there's also more content needed for the articles (title, subtitle, author), with tags <h5>, <h1>, etc.. These would be excluded by the keep_only_tags, and if try to include them also, the page that have the comic strip would show these tags, of course.

I think the way to go would be, as Starson suggest:

Quote:

Basically, you want a link to an html page with an img tag on it that holds your strip. If the site doesn't have a page like that (it should, otherwise how do you see it) you can build it yourself in the recipe.

These would override the symptoms you describe:

Quote:

Calibre is expecting a HTML-page as URL. You passed the address of a GIF-image to calibre, which was interpredet as a HTML-page an produced the character garbage you've seen.

But I don't know how to "build the HTML myself". :-(

Maybe you know, pete? ;-)

Starson17 · 09-23-2011, 01:54 PM

Quote:

Originally Posted by macpablus

But I don't know how to "build the HTML myself". :-(

I'm following along. So far, a.peter's comments have been excellent, so I haven't posted anything. One comment I have is that you made it harder to help you by posting only a partial recipe. I suspect you were trying to simplify, but a solution to one part may complicate another part - as you know from the comments on keep_only.

It sort of sounds like you're worried about this interaction so posting the entire recipe would be good. I'm also not sure exactly where your problem is. You've posted about worrying that using keep_only for the articles will keep the wrong stuff for the comic strip page. That sounds like you've got the recipe working for the page with links to the feed(s) and the page with links to the articles and your only problem left is controlling any excess junk that appears with the strip without affecting the articles. If that's where you are, then there are many options. If you aren't at that point yet, then we need to get you there.

You may want to review BeautifulSoup, extract() and insert(). Those tools will let you modify a page as needed. You can postprocess_html, identify the page that has the comic strip and process it with BS to do whatever you need, including building a page entirely from scratch if that's needed.

macpablus · 09-23-2011, 03:18 PM

Quote:

Originally Posted by Starson17

One comment I have is that you made it harder to help you by posting only a partial recipe. I suspect you were trying to simplify, but a solution to one part may complicate another part - as you know from the comments on keep_only.

That's right, I was trying to simplify 'cause I didn't want to bother too much.

Sorry for that. Here's the entire (original) recipe, that in fact is included in the last version of Calibre:

Spoiler:

Code:

#!/usr/bin/env  python

__license__   = 'GPL v3'
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
'''
pagina12.com.ar
'''
import re

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

class Pagina12(BasicNewsRecipe):

    title      = 'Pagina/12 - Edicion Impresa'
    __author__ = 'Pablo Marfil'
    description = 'Diario argentino'
    INDEX = 'http://www.pagina12.com.ar/diario/secciones/index.html'
    language = 'es'
    encoding              = 'cp1252'
    remove_tags_before = dict(id='fecha')
    remove_tags_after  = dict(id='fin')
    remove_tags        = [dict(id=['fecha', 'fin', 'pageControls','logo','logo_suple','fecha_suple','volver'])]
    masthead_url          = 'http://www.pagina12.com.ar/commons/imgs/logo-home.gif'
    no_stylesheets = True

    preprocess_regexps= [(re.compile(r'<!DOCTYPE[^>]+>', re.I), lambda m:'')]




    def get_cover_url(self):
        soup = self.index_to_soup('http://www.pagina12.com.ar/diario/principal/diario/index.html')
        for image in soup.findAll('img',alt=True):
           if image['alt'].startswith('Tapa de la fecha'):
              return image['src']
              print image
        return None


    def parse_index(self):
        articles = []
        numero = 1
        raw = self.index_to_soup('http://www.pagina12.com.ar/diario/secciones/index.html', raw=True)
        raw = re.sub(r'(?i)<!DOCTYPE[^>]+>', '', raw)
        soup = self.index_to_soup(raw)

        feeds = []

        seen_titles = set([])
        for section in soup.findAll('div','seccionx'):
            numero+=1
            print (numero)
            section_title = self.tag_to_string(section.find('div','desplegable_titulo on_principal right'))
            self.log('Found section:', section_title)
            articles = []
            for post in section.findAll('h2'):
                h = post.find('a', href=True)
                title = self.tag_to_string(h)
                if title in seen_titles:
                    continue
                seen_titles.add(title)
                a = post.find('a', href=True)
                url = a['href']
                if url.startswith('/'):
                    url = 'http://pagina12.com.ar/imprimir'+url
                p = post.find('div', attrs={'h2'})
                desc = None
                self.log('\tFound article:', title, 'at', url)
                if p is not None:
                    desc = self.tag_to_string(p)
                    self.log('\t\t', desc)
                articles.append({'title':title, 'url':url, 'description':desc,
                    'date':''})
            if articles:
                feeds.append((section_title, articles))
        return feeds


    def postprocess_html(self, soup, first):
        for table in soup.findAll('table', align='right'):
            img = table.find('img')
            if img is not None:
                img.extract()
                caption = self.tag_to_string(table).strip()
                div = Tag(soup, 'div')
                div['style'] = 'text-align:center'
                div.insert(0, img)
                div.insert(1, Tag(soup, 'br'))
                if caption:
                    div.insert(2, NavigableString(caption))
                table.replaceWith(div)

        return soup

My goal is to generate a new feed containing only the comic strip from...

http://www.pagina12.com.ar/diario/ultimas/index.html

..that is included in <div class="top12 center" id="rudy_paz">.

So, your description seems correct (again!):

Quote:

That sounds like you've got the recipe working for the page with links to the feed(s) and the page with links to the articles and your only problem left is controlling any excess junk that appears with the strip without affecting the articles. If that's where you are, then there are many options.

If my GPS is working as expected, I'm right there.

a.peter · 09-25-2011, 07:48 AM

Quote:

Originally Posted by macpablus

My goal is to generate a new feed containing only the comic strip from...

http://www.pagina12.com.ar/diario/ultimas/index.html

Hi macpablus, i found time to have a look at your recipe.

First of all i saw that the daily comic is located at http://www.pagina12.com.ar/diario/principal/index.html.

All i had to do was to add this page as a single feed 'Humor' with a single article. Then i modified the postprocess_html. I tried to find a div with id='rudy_paz'. When this div is present, i extracted the div from the soup, removed all content from the soups body, added the image again and returned the soup.

Spoiler:

Then the remove_tags_before seems not to work as i expected so i removed it.

The complete recipe is here:

Spoiler:

Code:

#!/usr/bin/env  python

__license__   = 'GPL v3'
__copyright__ = '2011, Pablo Marfill'
'''
Calibre recipe to convert the news site pagina12.com.ar to an ebook
'''
import re

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

class Pagina12(BasicNewsRecipe):

    title      = 'Pagina/12 - Edicion Impresa'
    __author__ = 'Pablo Marfil, a.peter'
    description = 'Diario argentino'
    INDEX = 'http://www.pagina12.com.ar/diario/secciones/index.html'
    language = 'es'
    encoding              = 'cp1252'
    #remove_tags_before = [dict(id='fecha')] # Is a list of dictionaries. But did not work. Why? I don't know.
    remove_tags_after  = [dict(id='fin')]   # Is a list of dictionaries
    remove_tags        = [dict(id=['volver', 'logo', 'fecha', 'fin', 'pageControls', 'logo_suple', 'fecha_suple'])]
    masthead_url          = 'http://www.pagina12.com.ar/commons/imgs/logo-home.gif'
    no_stylesheets = True

    preprocess_regexps= [(re.compile(r'<!DOCTYPE[^>]+>', re.I), lambda m:'')]

    def get_cover_url(self):
        soup = self.index_to_soup('http://www.pagina12.com.ar/diario/principal/diario/index.html')
        for image in soup.findAll('img',alt=True):
           if image['alt'].startswith('Tapa de la fecha'):
              return image['src']
              print image
        return None


    def parse_index(self):
        articles = []
        numero = 1
        raw = self.index_to_soup('http://www.pagina12.com.ar/diario/secciones/index.html', raw=True)
        raw = re.sub(r'(?i)<!DOCTYPE[^>]+>', '', raw)
        soup = self.index_to_soup(raw)
        
        feeds = []
        feeds.append(('Humor', [{'title':'Rudy y Daniel Paz', 'url':'http://www.pagina12.com.ar/diario/principal/index.html', 'description':'Daily comic', 'date':''}]))

        seen_titles = set([])
        for section in soup.findAll('div','seccionx'):
            numero+=1
            print (numero)
            section_title = self.tag_to_string(section.find('div','desplegable_titulo on_principal right'))
            self.log('Found section:', section_title)
            articles = []
            for post in section.findAll('h2'):
                h = post.find('a', href=True)
                title = self.tag_to_string(h)
                if title in seen_titles:
                    continue
                seen_titles.add(title)
                a = post.find('a', href=True)
                url = a['href']
                if url.startswith('/'):
                    url = 'http://pagina12.com.ar/imprimir'+url
                p = post.find('div', attrs={'h2'})
                desc = None
                self.log('\tFound article:', title, 'at', url)
                if p is not None:
                    desc = self.tag_to_string(p)
                    self.log('\t\t', desc)
                articles.append({'title':title, 'url':url, 'description':desc,
                    'date':''})
            if articles:
                feeds.append((section_title, articles))
        return feeds


    def postprocess_html(self, soup, first):
        # Added by a.peter:
        # Try to find the div containing the image
        image = soup.find('div', attrs={'id':'rudy_paz'})
        if image:
            # if found, extract the div, clear the body and add the image again. Finished.
            image.extract()
            while len(soup.body) > 0:
                soup.body.next.extract()
            soup.body.insert(0, image)
            return soup

        for table in soup.findAll('table', align='right'):
            img = table.find('img')
            if img is not None:
                img.extract()
                caption = self.tag_to_string(table).strip()
                div = Tag(soup, 'div')
                div['style'] = 'text-align:center'
                div.insert(0, img)
                div.insert(1, Tag(soup, 'br'))
                if caption:
                    div.insert(2, NavigableString(caption))
                table.replaceWith(div)

        return soup

In the debug mode (only two feeds with each two articles) produced the following output:

Pagina12.epub

I put your name into the copyright and added myself as a co-author

macpablus · 09-25-2011, 08:08 PM

Great!

Definitively, you deserve the co-authoring. ;-)

But now, I'm going for more. I'll try to add a second comic strip, to see if I learned something from this. Stay tuned!

macpablus · 09-26-2011, 10:59 AM

And here it is. Modified postprocess_html that inserts a second comic strip (the one located at the end of the page):

Spoiler:

Now, I'm trying to insert an <hr> tag between the two, but I can't find the way.

a.peter · 09-26-2011, 12:07 PM

Quote:

Originally Posted by macpablus

Now, I'm trying to insert an <hr> tag between the two, but I can't find the way.

Well done

And a few new things to learn.

First of all: programmers are lazy. Always try to do as much as possible inside of loops.

To do this, we will use findAll instead of find to look for all images in the page. The good thing is, that the second parameter (attrs) accepts lists of values.

Code:

images = soup.findAll('div', attrs={'id':['rudy_paz', 'rep']})

This code will find all divs with id either 'rudy_paz' or 'rep'. Cool. Now we have a list of images, if len(images) > 0. (The len operator counts the number of elements inside a list.)

Now we have a list which we may iterate over, using

Code:

for image in images:
    <do something with variable image>

To add new elements the soup offers the method insert along with the class Tag.

To create a new Tag you call something like hr = Tag(soup, "tr"). This creates a <hr></hr>. To add this to the soup at a certain position you may call soup.body.insert(0, hr). But because programmers are lazy they will call something like

Code:

soup.body.insert(0, Tag(soup, "hr"))

Now we have everything together to do all you wanted. Try to link this stuff, with the old image.extract() and so on. In case of trouble, you may look at the spoiler

Spoiler:

09-19-2011, 08:33 PM	#1
macpablus Enthusiast Posts: 25 Karma: 1896 Join Date: Aug 2011 Device: Kindle 3	Adding a comic strip to a newspaper's recipe I'd like to add a comic strip from the index page of a newspaper. Until now, I manage to replace the cover with the comic image, using: def get_cover_url ...but it would be nicer to have the comic inserted as an article. In my recype, the articles are retrieved with parse_index.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Android Daily Comic Strip Viewer (Has to be Downloaded Via PC)	obsessed2	enTourage Archive	5	04-28-2011 06:35 PM
Comic strip contest!!	The Terminator	Lounge	1	02-22-2011 08:50 PM
Dilbert Comic Strip	switchman2210	General Discussions	4	09-24-2010 07:57 PM
Automatic Daily Comic Strip Download	Adam B.	iRex	10	08-11-2007 05:33 AM

09-25-2011, 08:08 PM	#13
macpablus Enthusiast Posts: 25 Karma: 1896 Join Date: Aug 2011 Device: Kindle 3	Great! Definitively, you deserve the co-authoring. ;-) But now, I'm going for more. I'll try to add a second comic strip, to see if I learned something from this. Stay tuned!