Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 09-19-2011, 08:33 PM   #1
macpablus
Enthusiast
macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.
 
Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
Adding a comic strip to a newspaper's recipe

I'd like to add a comic strip from the index page of a newspaper.

Until now, I manage to replace the cover with the comic image, using:

def get_cover_url

...but it would be nicer to have the comic inserted as an article.

In my recype, the articles are retrieved with parse_index.
macpablus is offline   Reply With Quote
Old 09-20-2011, 02:09 PM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by macpablus View Post
I'd like to add a comic strip from the index page of a newspaper.

Until now, I manage to replace the cover with the comic image, using:

def get_cover_url

...but it would be nicer to have the comic inserted as an article.

In my recype, the articles are retrieved with parse_index.
Parse_index builds a feed composed of multiple articles. Each article has a title and a link where the article content can be fetched. Modify parse_index to add the article you want with a title and link pointing to the comic strip.
Starson17 is offline   Reply With Quote
Old 09-20-2011, 11:53 PM   #3
macpablus
Enthusiast
macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.
 
Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
Quote:
Originally Posted by Starson17 View Post
Modify parse_index to add the article you want with a title and link pointing to the comic strip.
Just done something like that, but the article shows strange characters only. I guess that happens because the link points to an image instead of and HTML file. How do I solve that?

Last edited by macpablus; 09-21-2011 at 12:18 AM.
macpablus is offline   Reply With Quote
Old 09-21-2011, 04:30 PM   #4
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by macpablus View Post
Just done something like that, but the article shows strange characters only. I guess that happens because the link points to an image instead of and HTML file. How do I solve that?
If you post your recipe it would be easier to see what the problem is. You might review some of my comic recipes, such as Arcamax or Gocomics/Comics.com to see how articles and images interact. Basically, you want a link to an html page with an img tag on it that holds your strip. If the site doesn't have a page like that (it should, otherwise how do you see it) you can build it yourself in the recipe.
Starson17 is offline   Reply With Quote
Old 09-21-2011, 08:42 PM   #5
macpablus
Enthusiast
macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.
 
Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
Quote:
Originally Posted by Starson17 View Post
If you post your recipe it would be easier to see what the problem is.
All right, here it is:

Spoiler:
Code:
#!/usr/bin/env  python

__license__   = 'GPL v3'
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
'''
pagina12.com.ar
'''
import re

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

class Pagina12(BasicNewsRecipe):

    title      = 'Pagina/12 - Edicion Impresa'
    __author__ = 'Pablo Marfil'
    description = 'Diario argentino'
    INDEX = 'http://www.pagina12.com.ar/diario/secciones/index.html'
    language = 'es'
    encoding              = 'cp1252'
    remove_tags_before = dict(id='fecha')	
    remove_tags_after  = dict(id='fin')
    remove_tags        = [dict(id=['fecha', 'fin', 'pageControls','logo','logo_suple','fecha_suple','volver'])]
    masthead_url          = 'http://www.pagina12.com.ar/commons/imgs/logo-home.gif'	
    no_stylesheets = True

    preprocess_regexps= [(re.compile(r'<!DOCTYPE[^>]+>', re.I), lambda m:'')]  

  
		
		
    def parse_index(self):
    

        feeds = []
        comic = []
        soup = self.index_to_soup('http://www.pagina12.com.ar/diario/ultimas/index.html')
        for image in soup.findAll('img',alt=True):
            if image['alt'].startswith('Daniel Paz'):
                comic.append({'title':'Rudy y Daniel Paz', 'url':image['src'], 'description':'',
                    'date':''})
            print image['src']    
        if comic:
            print 'TIRA HALLADA:',comic
            feeds.append(('Humor', comic))				
        return feeds



Quote:
Basically, you want a link to an html page with an img tag on it that holds your strip. If the site doesn't have a page like that (it should, otherwise how do you see it) you can build it yourself in the recipe.
The site has a page, of course, but it contains A LOT of "other things" that I'm not interested in (I mean, for my particular purpose). In fact, is the page that I'm using as the index for parsing the newspaper contents.

So, it seems that I should "build it myself" in the recipe...
macpablus is offline   Reply With Quote
Old 09-22-2011, 01:03 PM   #6
a.peter
Enthusiast
a.peter began at the beginning.
 
Posts: 28
Karma: 10
Join Date: Sep 2011
Device: Sony PRS-350, Kindle Touch
Okay. I see your problem.

In fact, the return value of parse_index(self) is:

Code:
[
 ('title', [
            {'title':..., 'url':..., 'description':..., 'date':...},
            More dictionaries as above ...
           ]
 ),
 More tuples with genres
]
The url has to be a HTML page.

On each of these pages, the values of remove_tags and so on are executed, resulting in a cleaned HTML-page.

A working example would be:

Spoiler:
Code:
from calibre.ebooks.BeautifulSoup import Tag, NavigableString
import re

class Pagina12(BasicNewsRecipe):
    title      = 'Pagina/12 - Edicion Impresa'
    __author__ = 'Pablo Marfil'
    description = 'Diario argentino'
    INDEX = 'http://www.pagina12.com.ar/diario/secciones/index.html'
    language = 'es'
    encoding              = 'cp1252'
    keep_only_tags        = [dict(name='div', attrs={'id':'rudy_paz'})]
    masthead_url          = 'http://www.pagina12.com.ar/commons/imgs/logo-home.gif'	
    no_stylesheets = True

    #preprocess_regexps= [(re.compile(r'<!DOCTYPE[^>]+>', re.I), lambda m:'')]  

    def parse_index(self):
        feeds = [('Humor', [{'title':'Rudy y Daniel Paz', 'url':'http://www.pagina12.com.ar/diario/ultimas/index.html', 'description':'', 'date':''}])]
        print feeds
        raw_input('...')
        return feeds

Last edited by a.peter; 09-22-2011 at 01:21 PM.
a.peter is offline   Reply With Quote
Old 09-22-2011, 05:59 PM   #7
macpablus
Enthusiast
macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.
 
Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
Quote:
Originally Posted by a.peter View Post
On each of these pages, the values of remove_tags and so on are executed, resulting in a cleaned HTML-page.
Thanks, Peter.

The problem is that my complete recipe has other feeds (i.e, the content of the whole newspaper, with many different sections and articles), so the option keep_only_tags will affect each of the articles. :-(
macpablus is offline   Reply With Quote
Old 09-23-2011, 02:16 AM   #8
a.peter
Enthusiast
a.peter began at the beginning.
 
Posts: 28
Karma: 10
Join Date: Sep 2011
Device: Sony PRS-350, Kindle Touch
Quote:
Originally Posted by macpablus View Post
The problem is that my complete recipe has other feeds (i.e, the content of the whole newspaper, with many different sections and articles), so the option keep_only_tags will affect each of the articles.
It's clear to me, that my recipe isn't complete. It was done to show you, that Calibre is expecting a HTML-page as URL. You passed the address of a GIF-image to calibre, which was interpredet as a HTML-page an produced the character garbage you've seen.

The good point is, that the keep_only_tags member is a list of dictionaries. You may add any other expression you need to parse other pages. If i take a look at an article, e. g. http://www.pagina12.com.ar/diario/el...011-09-22.html, i see that the actual article is embedded into a <div class="nota top12"> tag.

A modified keep_only_tags may be:

Code:
keep_only_tags = [dict(name='div', attrs={'id':'rudy_paz'}), dict(name='div', attrs={'class':'nota top12'})]
With this code, calibre will keep
  • all <div> with id="rudy_paz' AND
  • all <div> with class='nota top12'

It's no matter if they dont appear on the same page. But if you pass one page with the comic strip and a list of pages with articles, it will work on both of them.

By the way: For convenience, you may replace the second part of a dictionary entry of the keep_only_tags by a compiled regular expression, e. g. attrs={'class':re.compile('top.*')}

But don't forget to add a
Code:
import re
at the top of the recipe.
a.peter is offline   Reply With Quote
Old 09-23-2011, 12:39 PM   #9
macpablus
Enthusiast
macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.
 
Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
Quote:
Originally Posted by a.peter View Post
The good point is, that the keep_only_tags member is a list of dictionaries. You may add any other expression you need to parse other pages. If i take a look at an article, e. g. http://www.pagina12.com.ar/diario/el...011-09-22.html, i see that the actual article is embedded into a <div class="nota top12"> tag.
In fact, I'm using the print version for the articles:

http://www.pagina12.com.ar/imprimir/...011-09-22.html

The actual article is contained into this tag: <div id="cuerpo">.

But before this, there's also more content needed for the articles (title, subtitle, author), with tags <h5>, <h1>, etc.. These would be excluded by the keep_only_tags, and if try to include them also, the page that have the comic strip would show these tags, of course.

I think the way to go would be, as Starson suggest:
Quote:
Basically, you want a link to an html page with an img tag on it that holds your strip. If the site doesn't have a page like that (it should, otherwise how do you see it) you can build it yourself in the recipe.
These would override the symptoms you describe:

Quote:
Calibre is expecting a HTML-page as URL. You passed the address of a GIF-image to calibre, which was interpredet as a HTML-page an produced the character garbage you've seen.
But I don't know how to "build the HTML myself". :-(

Maybe you know, pete? ;-)
macpablus is offline   Reply With Quote
Old 09-23-2011, 01:54 PM   #10
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by macpablus View Post
But I don't know how to "build the HTML myself". :-(
I'm following along. So far, a.peter's comments have been excellent, so I haven't posted anything. One comment I have is that you made it harder to help you by posting only a partial recipe. I suspect you were trying to simplify, but a solution to one part may complicate another part - as you know from the comments on keep_only.

It sort of sounds like you're worried about this interaction so posting the entire recipe would be good. I'm also not sure exactly where your problem is. You've posted about worrying that using keep_only for the articles will keep the wrong stuff for the comic strip page. That sounds like you've got the recipe working for the page with links to the feed(s) and the page with links to the articles and your only problem left is controlling any excess junk that appears with the strip without affecting the articles. If that's where you are, then there are many options. If you aren't at that point yet, then we need to get you there.

You may want to review BeautifulSoup, extract() and insert(). Those tools will let you modify a page as needed. You can postprocess_html, identify the page that has the comic strip and process it with BS to do whatever you need, including building a page entirely from scratch if that's needed.
Starson17 is offline   Reply With Quote
Old 09-23-2011, 03:18 PM   #11
macpablus
Enthusiast
macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.
 
Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
Quote:
Originally Posted by Starson17 View Post
One comment I have is that you made it harder to help you by posting only a partial recipe. I suspect you were trying to simplify, but a solution to one part may complicate another part - as you know from the comments on keep_only.
That's right, I was trying to simplify 'cause I didn't want to bother too much.

Sorry for that. Here's the entire (original) recipe, that in fact is included in the last version of Calibre:
Spoiler:

Code:
#!/usr/bin/env  python

__license__   = 'GPL v3'
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
'''
pagina12.com.ar
'''
import re

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

class Pagina12(BasicNewsRecipe):

    title      = 'Pagina/12 - Edicion Impresa'
    __author__ = 'Pablo Marfil'
    description = 'Diario argentino'
    INDEX = 'http://www.pagina12.com.ar/diario/secciones/index.html'
    language = 'es'
    encoding              = 'cp1252'
    remove_tags_before = dict(id='fecha')
    remove_tags_after  = dict(id='fin')
    remove_tags        = [dict(id=['fecha', 'fin', 'pageControls','logo','logo_suple','fecha_suple','volver'])]
    masthead_url          = 'http://www.pagina12.com.ar/commons/imgs/logo-home.gif'
    no_stylesheets = True

    preprocess_regexps= [(re.compile(r'<!DOCTYPE[^>]+>', re.I), lambda m:'')]




    def get_cover_url(self):
        soup = self.index_to_soup('http://www.pagina12.com.ar/diario/principal/diario/index.html')
        for image in soup.findAll('img',alt=True):
           if image['alt'].startswith('Tapa de la fecha'):
              return image['src']
              print image
        return None


    def parse_index(self):
        articles = []
        numero = 1
        raw = self.index_to_soup('http://www.pagina12.com.ar/diario/secciones/index.html', raw=True)
        raw = re.sub(r'(?i)<!DOCTYPE[^>]+>', '', raw)
        soup = self.index_to_soup(raw)

        feeds = []

        seen_titles = set([])
        for section in soup.findAll('div','seccionx'):
            numero+=1
            print (numero)
            section_title = self.tag_to_string(section.find('div','desplegable_titulo on_principal right'))
            self.log('Found section:', section_title)
            articles = []
            for post in section.findAll('h2'):
                h = post.find('a', href=True)
                title = self.tag_to_string(h)
                if title in seen_titles:
                    continue
                seen_titles.add(title)
                a = post.find('a', href=True)
                url = a['href']
                if url.startswith('/'):
                    url = 'http://pagina12.com.ar/imprimir'+url
                p = post.find('div', attrs={'h2'})
                desc = None
                self.log('\tFound article:', title, 'at', url)
                if p is not None:
                    desc = self.tag_to_string(p)
                    self.log('\t\t', desc)
                articles.append({'title':title, 'url':url, 'description':desc,
                    'date':''})
            if articles:
                feeds.append((section_title, articles))
        return feeds


    def postprocess_html(self, soup, first):
        for table in soup.findAll('table', align='right'):
            img = table.find('img')
            if img is not None:
                img.extract()
                caption = self.tag_to_string(table).strip()
                div = Tag(soup, 'div')
                div['style'] = 'text-align:center'
                div.insert(0, img)
                div.insert(1, Tag(soup, 'br'))
                if caption:
                    div.insert(2, NavigableString(caption))
                table.replaceWith(div)

        return soup


My goal is to generate a new feed containing only the comic strip from...

http://www.pagina12.com.ar/diario/ultimas/index.html

..that is included in <div class="top12 center" id="rudy_paz">.

So, your description seems correct (again!):

Quote:
That sounds like you've got the recipe working for the page with links to the feed(s) and the page with links to the articles and your only problem left is controlling any excess junk that appears with the strip without affecting the articles. If that's where you are, then there are many options.
If my GPS is working as expected, I'm right there.

Last edited by macpablus; 09-23-2011 at 03:25 PM.
macpablus is offline   Reply With Quote
Old 09-25-2011, 07:48 AM   #12
a.peter
Enthusiast
a.peter began at the beginning.
 
Posts: 28
Karma: 10
Join Date: Sep 2011
Device: Sony PRS-350, Kindle Touch
Quote:
Originally Posted by macpablus View Post
My goal is to generate a new feed containing only the comic strip from...

http://www.pagina12.com.ar/diario/ultimas/index.html
Hi macpablus, i found time to have a look at your recipe.

First of all i saw that the daily comic is located at http://www.pagina12.com.ar/diario/principal/index.html.

All i had to do was to add this page as a single feed 'Humor' with a single article. Then i modified the postprocess_html. I tried to find a div with id='rudy_paz'. When this div is present, i extracted the div from the soup, removed all content from the soups body, added the image again and returned the soup.

Spoiler:
Code:
        # Try to find the div containing the image
        image = soup.find('div', attrs={'id':'rudy_paz'})
        if image:
            # if found, extract the div, clear the body and add the image again. Finished.
            image.extract()
            while len(soup.body) > 0:
                soup.body.next.extract()
            soup.body.insert(0, image)
            return soup


Then the remove_tags_before seems not to work as i expected so i removed it.

The complete recipe is here:

Spoiler:
Code:
#!/usr/bin/env  python

__license__   = 'GPL v3'
__copyright__ = '2011, Pablo Marfill'
'''
Calibre recipe to convert the news site pagina12.com.ar to an ebook
'''
import re

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

class Pagina12(BasicNewsRecipe):

    title      = 'Pagina/12 - Edicion Impresa'
    __author__ = 'Pablo Marfil, a.peter'
    description = 'Diario argentino'
    INDEX = 'http://www.pagina12.com.ar/diario/secciones/index.html'
    language = 'es'
    encoding              = 'cp1252'
    #remove_tags_before = [dict(id='fecha')] # Is a list of dictionaries. But did not work. Why? I don't know.
    remove_tags_after  = [dict(id='fin')]   # Is a list of dictionaries
    remove_tags        = [dict(id=['volver', 'logo', 'fecha', 'fin', 'pageControls', 'logo_suple', 'fecha_suple'])]
    masthead_url          = 'http://www.pagina12.com.ar/commons/imgs/logo-home.gif'
    no_stylesheets = True

    preprocess_regexps= [(re.compile(r'<!DOCTYPE[^>]+>', re.I), lambda m:'')]

    def get_cover_url(self):
        soup = self.index_to_soup('http://www.pagina12.com.ar/diario/principal/diario/index.html')
        for image in soup.findAll('img',alt=True):
           if image['alt'].startswith('Tapa de la fecha'):
              return image['src']
              print image
        return None


    def parse_index(self):
        articles = []
        numero = 1
        raw = self.index_to_soup('http://www.pagina12.com.ar/diario/secciones/index.html', raw=True)
        raw = re.sub(r'(?i)<!DOCTYPE[^>]+>', '', raw)
        soup = self.index_to_soup(raw)
        
        feeds = []
        feeds.append(('Humor', [{'title':'Rudy y Daniel Paz', 'url':'http://www.pagina12.com.ar/diario/principal/index.html', 'description':'Daily comic', 'date':''}]))

        seen_titles = set([])
        for section in soup.findAll('div','seccionx'):
            numero+=1
            print (numero)
            section_title = self.tag_to_string(section.find('div','desplegable_titulo on_principal right'))
            self.log('Found section:', section_title)
            articles = []
            for post in section.findAll('h2'):
                h = post.find('a', href=True)
                title = self.tag_to_string(h)
                if title in seen_titles:
                    continue
                seen_titles.add(title)
                a = post.find('a', href=True)
                url = a['href']
                if url.startswith('/'):
                    url = 'http://pagina12.com.ar/imprimir'+url
                p = post.find('div', attrs={'h2'})
                desc = None
                self.log('\tFound article:', title, 'at', url)
                if p is not None:
                    desc = self.tag_to_string(p)
                    self.log('\t\t', desc)
                articles.append({'title':title, 'url':url, 'description':desc,
                    'date':''})
            if articles:
                feeds.append((section_title, articles))
        return feeds


    def postprocess_html(self, soup, first):
        # Added by a.peter:
        # Try to find the div containing the image
        image = soup.find('div', attrs={'id':'rudy_paz'})
        if image:
            # if found, extract the div, clear the body and add the image again. Finished.
            image.extract()
            while len(soup.body) > 0:
                soup.body.next.extract()
            soup.body.insert(0, image)
            return soup

        for table in soup.findAll('table', align='right'):
            img = table.find('img')
            if img is not None:
                img.extract()
                caption = self.tag_to_string(table).strip()
                div = Tag(soup, 'div')
                div['style'] = 'text-align:center'
                div.insert(0, img)
                div.insert(1, Tag(soup, 'br'))
                if caption:
                    div.insert(2, NavigableString(caption))
                table.replaceWith(div)

        return soup


In the debug mode (only two feeds with each two articles) produced the following output:

Pagina12.epub

I put your name into the copyright and added myself as a co-author
a.peter is offline   Reply With Quote
Old 09-25-2011, 08:08 PM   #13
macpablus
Enthusiast
macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.
 
Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
Great!

Definitively, you deserve the co-authoring. ;-)

But now, I'm going for more. I'll try to add a second comic strip, to see if I learned something from this. Stay tuned!
macpablus is offline   Reply With Quote
Old 09-26-2011, 10:59 AM   #14
macpablus
Enthusiast
macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.
 
Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
And here it is. Modified postprocess_html that inserts a second comic strip (the one located at the end of the page):

Spoiler:
Code:
    def postprocess_html(self, soup, first):
        # Try to find the div containing the first image
        rudy = soup.find('div', attrs={'id':'rudy_paz'})
		# Try to find the div containing the second image
        rep = soup.find('div', attrs={'id':'rep'})
                        
        if rep:
            # if found, extract the div, clear the body and add the image again. Finished.
            #rep.extract()
            while len(soup.body) > 0:
                soup.body.next.extract()
            soup.body.insert(0, rudy)
            soup.body.insert(rudy,rep)
            return soup


Now, I'm trying to insert an <hr> tag between the two, but I can't find the way.
macpablus is offline   Reply With Quote
Old 09-26-2011, 12:07 PM   #15
a.peter
Enthusiast
a.peter began at the beginning.
 
Posts: 28
Karma: 10
Join Date: Sep 2011
Device: Sony PRS-350, Kindle Touch
Quote:
Originally Posted by macpablus View Post
Now, I'm trying to insert an <hr> tag between the two, but I can't find the way.
Well done

And a few new things to learn.

First of all: programmers are lazy. Always try to do as much as possible inside of loops.

To do this, we will use findAll instead of find to look for all images in the page. The good thing is, that the second parameter (attrs) accepts lists of values.

Code:
images = soup.findAll('div', attrs={'id':['rudy_paz', 'rep']})
This code will find all divs with id either 'rudy_paz' or 'rep'. Cool. Now we have a list of images, if len(images) > 0. (The len operator counts the number of elements inside a list.)

Now we have a list which we may iterate over, using

Code:
for image in images:
    <do something with variable image>
To add new elements the soup offers the method insert along with the class Tag.

To create a new Tag you call something like hr = Tag(soup, "tr"). This creates a <hr></hr>. To add this to the soup at a certain position you may call soup.body.insert(0, hr). But because programmers are lazy they will call something like

Code:
soup.body.insert(0, Tag(soup, "hr"))
Now we have everything together to do all you wanted. Try to link this stuff, with the old image.extract() and so on. In case of trouble, you may look at the spoiler

Spoiler:
Code:
    def postprocess_html(self, soup, first):
        # Added by a.peter:
        # Try to find the divs containing images
        images = soup.findAll('div', attrs={'id':['rudy_paz', 'rep']})
        # if there are images
        if len(images) > 0:
            # extract them from the soup
            for image in images:
                image.extract()
            # clear the body tag by removing all unneeded elements
            while len(soup.body) > 0:
                soup.body.next.extract()
            # add all images an a <hr/>
            for image in images:
                soup.body.insert(0, image)
                soup.body.insert(0, Tag(soup, "hr"))
            # there is one <hr/> to much so we remove it
            soup.find('hr').extract()
            return soup
        
        for table in soup.findAll('table', align='right'):
            img = table.find('img')
            if img is not None:
                img.extract()
                caption = self.tag_to_string(table).strip()
                div = Tag(soup, 'div')
                div['style'] = 'text-align:center'
                div.insert(0, img)
                div.insert(1, Tag(soup, 'br'))
                if caption:
                    div.insert(2, NavigableString(caption))
                table.replaceWith(div)

        return soup
a.peter is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Android Daily Comic Strip Viewer (Has to be Downloaded Via PC) obsessed2 enTourage Archive 5 04-28-2011 06:35 PM
Comic strip contest!! The Terminator Lounge 1 02-22-2011 08:50 PM
Dilbert Comic Strip switchman2210 General Discussions 4 09-24-2010 07:57 PM
Automatic Daily Comic Strip Download Adam B. iRex 10 08-11-2007 05:33 AM


All times are GMT -4. The time now is 10:42 PM.


MobileRead.com is a privately owned, operated and funded community.