I need some help with a recipe

jefferson_frantz · 09-29-2010, 12:06 AM

Hello everyone. I'm new with calibre recipes and i need some help with a recipe for 'Muy Interesante' magazine (http://www.muyinteresante.es).
First, i want to change the style for the title of the articles, maybe just to put in bold style.
Second, i need to insert a <br> tag after the image in the article, so the text appears below the image and not next to it. The attached image maybe explain better what i want

Thanks in advance!.

Here is my recipe:

Code:

from calibre.web.feeds.recipes import BasicNewsRecipe
from BeautifulSoup import BeautifulSoup, Tag

class RevistaMuyInteresante(BasicNewsRecipe):

    title       = 'Revista Muy Interesante'
    __author__  = 'Jefferson Frantz'
    description = 'Revista de divulgacion'
    timefmt = ' [%d %b, %Y]'
    language = 'es_ES'

    keep_only_tags = [dict(name='div', attrs={'class':['article']}),dict(name='td', attrs={'class':['txt_articulo']})]

    remove_tags        = [
                             dict(name=['object','link','script','ul'])
                            ,dict(name='div', attrs={'id':['comment']})
                            ,dict(name='td', attrs={'class':['buttonheading']})
                            ,dict(name='div', attrs={'class':['tags_articles']})
                         ]

    remove_tags_after = dict(name='div', attrs={'class':'tags_articles'})


    def nz_parse_section(self, url):
            soup = self.index_to_soup(url)
            div = soup.find(attrs={'class':'contenido'})

            current_articles = []
            for x in div.findAllNext(attrs={'class':['headline']}):
                    a = x.find('a', href=True)
                    if a is None:
                        continue
                    title = self.tag_to_string(a)
                    url = a.get('href', False)
                    if not url or not title:
                        continue
                    if url.startswith('/'):
                         url = 'http://www.muyinteresante.es'+url
                    self.log('\t\tFound article:', title)
                    self.log('\t\t\t', url)
                    current_articles.append({'title': title, 'url':url,
                        'description':'', 'date':''})

            return current_articles


    def parse_index(self):
            feeds = []
            for title, url in [
                ('Historia',
                 'http://www.muyinteresante.es/historia-articulos'),
             ]:
               articles = self.nz_parse_section(url)
               if articles:
                   feeds.append((title, articles))
            return feeds

PS: Sorry about my english

TonytheBookworm · 09-29-2010, 12:32 AM

Quote:

Originally Posted by jefferson_frantz

Hello everyone. I'm new with calibre recipes and i need some help with a recipe for 'Muy Interesante' magazine (http://www.muyinteresante.es).
First, i want to change the style for the title of the articles, maybe just to put in bold style.
Second, i need to insert a <br> tag after the image in the article, so the text appears below the image and not next to it. The attached image maybe explain better what i want

Thanks in advance!.

Here is my recipe:

PS: Sorry about my english

as for the image thing do something like this

Spoiler:

and as for the bold title or whatever you add extra_css
so lets say your title was in a <div class='title'>..... </div> tag and you wanted it different
you would do this:

Spoiler:

TonytheBookworm · 09-29-2010, 02:39 AM

Alright time to call the pro in. Starson17 if you have a second can you look at this and tell me why the image doesn't show up? My thoughts in this are

get image tag
assign its value to a new variable
extract the image tag from the soup
make a new tag that contains a div and a p
put the image data back into the soup under the p tag that was created.

now sure what the heck i'm doing wrong because it looks like it should work.
when i used print statements it showed my newtag as <p> </p> but for whatever reason it never inserts the image data into that tag.
thanks for the help in advance.

Spoiler:

Code:

from calibre.web.feeds.recipes import BasicNewsRecipe
from BeautifulSoup import BeautifulSoup, Tag

class RevistaMuyInteresante(BasicNewsRecipe):

    title       = 'Revista Muy Interesante'
    __author__  = 'Jefferson Frantz'
    description = 'Revista de divulgacion'
    timefmt = ' [%d %b, %Y]'
    language = 'es_ES'
    #conversion_options = {'linearize_tables' : True}
    keep_only_tags = [dict(name='div', attrs={'class':['article']}),dict(name='td', attrs={'class':['txt_articulo']})]
    remove_tags        = [
                             dict(name=['object','link','script','ul'])
                            ,dict(name='div', attrs={'id':['comment']})
                            ,dict(name='td', attrs={'class':['buttonheading']})
                            ,dict(name='div', attrs={'class':['tags_articles']})
                         ]

    remove_tags_after = dict(name='div', attrs={'class':'tags_articles'})


    


    def nz_parse_section(self, url):
            soup = self.index_to_soup(url)
            div = soup.find(attrs={'class':'contenido'})

            current_articles = []
            for x in div.findAllNext(attrs={'class':['headline']}):
                    a = x.find('a', href=True)
                    if a is None:
                        continue
                    title = self.tag_to_string(a)
                    url = a.get('href', False)
                    if not url or not title:
                        continue
                    if url.startswith('/'):
                         url = 'http://www.muyinteresante.es'+url
                    self.log('\t\tFound article:', title)
                    self.log('\t\t\t', url)
                    current_articles.append({'title': title, 'url':url,
                        'description':'', 'date':''})

            return current_articles


    def parse_index(self):
            feeds = []
            for title, url in [
                ('Historia',
                 'http://www.muyinteresante.es/historia-articulos'),
             ]:
               articles = self.nz_parse_section(url)
               if articles:
                   feeds.append((title, articles))
            return feeds
    
    def preprocess_html(self, soup):
        
        for img_tag in soup.findAll('img'):
            parent_tag = img_tag.parent
            data = img_tag
            img_tag.extract()
            newdiv = Tag(soup,'div')
            newtag = Tag(soup,'p')
            newtag.insert(0,data)
            newdiv.insert(0,newtag)
            parent_tag.insert(0,newdiv)
            
            
            
            
            
        return soup

i keep getting this crap:
newdiv is: <div><p></p></div>
data is: <img style="float: left;" alt="ivision-marrojo" height="225" width="300" src="/images/stories/historia/ivision-marrojo.jpg" />
newtag is: <p></p>

which tells me it is obviously picking up the image tag and has it stored.
but for whatever reason it refuses to insert it into the newdiv

Starson17 · 09-29-2010, 12:28 PM

Quote:

Originally Posted by TonytheBookworm

Alright time to call the pro in. Starson17 if you have a second can you look at this and tell me why the image doesn't show up?

It's staring you in the face, but you probably haven't run into it before.

Quote:

now sure what the heck i'm doing wrong because it looks like it should work.
when i used print statements it showed my newtag as <p> </p> but for whatever reason it never inserts the image data into that tag.
thanks for the help in advance.

Spoiler:

Code:

from calibre.web.feeds.recipes import BasicNewsRecipe
from BeautifulSoup import BeautifulSoup, Tag

class RevistaMuyInteresante(BasicNewsRecipe):

    title       = 'Revista Muy Interesante'
    __author__  = 'Jefferson Frantz'
    description = 'Revista de divulgacion'
    timefmt = ' [%d %b, %Y]'
    language = 'es_ES'
    #conversion_options = {'linearize_tables' : True}
    keep_only_tags = [dict(name='div', attrs={'class':['article']}),dict(name='td', attrs={'class':['txt_articulo']})]
    remove_tags        = [
                             dict(name=['object','link','script','ul'])
                            ,dict(name='div', attrs={'id':['comment']})
                            ,dict(name='td', attrs={'class':['buttonheading']})
                            ,dict(name='div', attrs={'class':['tags_articles']})
                         ]

    remove_tags_after = dict(name='div', attrs={'class':'tags_articles'})


    


    def nz_parse_section(self, url):
            soup = self.index_to_soup(url)
            div = soup.find(attrs={'class':'contenido'})

            current_articles = []
            for x in div.findAllNext(attrs={'class':['headline']}):
                    a = x.find('a', href=True)
                    if a is None:
                        continue
                    title = self.tag_to_string(a)
                    url = a.get('href', False)
                    if not url or not title:
                        continue
                    if url.startswith('/'):
                         url = 'http://www.muyinteresante.es'+url
                    self.log('\t\tFound article:', title)
                    self.log('\t\t\t', url)
                    current_articles.append({'title': title, 'url':url,
                        'description':'', 'date':''})

            return current_articles


    def parse_index(self):
            feeds = []
            for title, url in [
                ('Historia',
                 'http://www.muyinteresante.es/historia-articulos'),
             ]:
               articles = self.nz_parse_section(url)
               if articles:
                   feeds.append((title, articles))
            return feeds
    
    def preprocess_html(self, soup):
        
        for img_tag in soup.findAll('img'):
            parent_tag = img_tag.parent
            data = img_tag
            img_tag.extract()
            newdiv = Tag(soup,'div')
            newtag = Tag(soup,'p')
            newtag.insert(0,data)
            newdiv.insert(0,newtag)
            parent_tag.insert(0,newdiv)
            
            
            
            
            
        return soup

i keep getting this crap:
newdiv is: <div><p></p></div>
data is: <img style="float: left;" alt="ivision-marrojo" height="225" width="300" src="/images/stories/historia/ivision-marrojo.jpg" />
newtag is: <p></p>

which tells me it is obviously picking up the image tag and has it stored.
but for whatever reason it refuses to insert it into the newdiv

It's here (last two characters of "/>"):

Code:

data is:  <img style="float: left;" alt="ivision-marrojo" height="225" width="300" src="/images/stories/historia/ivision-marrojo.jpg" />

For some reason, this self closing tag format makes Beautiful Soup very unhappy. Try this change to "data" in your recipe (bit of a kludge to work into whatever you're doing):

Code:

            data = img_tag
            new_img_tag = Tag(soup,'img')
            new_img_tag['src'] = img_tag['src']
            data = new_img_tag

and don't do the img_tag.extract(). I doubt if that's the most efficient way to do this, but I'm not really sure what you're doing with the recipe.

TonytheBookworm · 09-29-2010, 01:34 PM

Quote:

Originally Posted by Starson17

It's staring you in the face, but you probably haven't run into it before.

It's here (last two characters of "/>"):

Code:

data is:  <img style="float: left;" alt="ivision-marrojo" height="225" width="300" src="/images/stories/historia/ivision-marrojo.jpg" />

and don't do the img_tag.extract(). I doubt if that's the most efficient way to do this, but I'm not really sure what you're doing with the recipe.

Thanks that worked for the data part of it. At least it shows correctly in the print statements. However; when i do a print statement on the soup. I see no change what so ever that I can detect. It's like it isn't inserting it into the parent tag... If you get time could you look at this because I would really like to know how to fix it so I can use the knowledge in the future. I was thinking it was something to do with the tables so i linearized them. that didn't work so then i took and renamed the table tr, td to 'div' and that still didn't work

So could you spoon feed me just a little bit more (i'm not full yet)

thanks.

edit: the soup looks like this even after the changes

Spoiler:

Code:

parent tag is:  <td valign="top" colspan="2" class="txt_articulo">
<img style="float: left;" alt="boton-rojo" src="/images/stories/historia/boton-rojo.jpg" width="300" height="225" />Pocas veces cae el destino del mundo en las manos de un solo hombre. La media noche del <strong>26 de septiembre de 1983</strong> pudo ser la última para millones de personas si no hubiera sido por <strong>Stanislav Petrov</strong>. En una época llena de tensiones provocadas por la<a href="/tag/Guerra Fría "> Guerra Fría </a>y el miedo a un Apocalipsis nuclear, <strong>mantuvo la calma cuando las alarmas de un satélite de la URSS avisaron de un <a href="/tag/ataque nuclear">ataque nuclear</a> inminente</strong>. Se trataba del <strong>hombre que tenía a su alcance el “botón rojo</strong>”. <br /><br /> Orbitando sobre la Tierra, los satélites de alerta temprana rusos estaban preparados para detectar cualquier proyectil que se elevase sobre la línea del horizonte. Aquella noche, Petrov, teniente coronel de la Fuerza de misiles estratégicos del <a href="/tag/Ejercito ruso">Ejercito ruso</a>, se encontraba al mando del bunker Serpukhov-15 en Moscú. A las 00.14 de la noche saltaron todos los indicadores alertando de una fuente de calor que ascendía por el este. Sus características correspondían con las de un <a href="/tag/misil nuclear">misil nuclear</a> intercontinental.  <br /><br /> A pesar de la alarma que resonó en todo el bunker, Petrov se mantuvo escéptico. Podía ser un error, así que ordenó suspender la alarma y esperar. Sin embargo, poco después volvieron a sonar las sirenas cuando los satélites detectaron cuatro fuentes de calor más. Ya había perdido mucho tiempo y como declaró en el diario <em>Moscow News</em>: “No se pueden analizar bien las cosas en sólo un par de minutos, todo lo que se puede hacer es confiar en la intuición. Tenía dos opciones: o pensar que los ataques con misiles no parten de una sola base, o que el ordenador ha perdido la cabeza”. Optó por la segunda opción y esperó unos minutos más. <br /><br /> La tremenda tensión que “atenazaba a todos los presentes” desapareció de golpe cuando las alarmas cesaron. <strong>Lo que en realidad ocurrió es que, en estas fechas próximas al equinoccio de otoño, los satélites, la Tierra y el Sol se alinearon provocando un extraño error en los detectores</strong>. El Sol se había elevado sobre el horizonte en el ángulo exacto para que los <a target="_blank" href="/tag/satélites">satélites</a> interpretaran sus señales térmicas como un ataque de misiles.  <br /><br /> Después de esto, Stanislav Petrov fue relegado a un puesto inferior por desacatar las normas, y el error fue ocultado por el gobierno de la <a href="/tag/URSS">URSS</a>. El reconocimiento de su hazaña, en el que más tarde se llamó <strong>“Incidente del Equinoccio de Otoño”</strong>, no vino hasta mucho tiempo después cuando recibió su primer premio, "World Citizen Award", el 21 de mayo de 2004. En 2006 viajó a EEUU y fue homenajeado por las Naciones Unidas por su valiente actuación. A pesar de todo, cada vez que se entrevistó a Petrov siempre comentaba: “En todo este tiempo no me he considerado un héroe, sólo alguien que hizo su trabajo y lo hizo bien”. <br /><br /><strong><span style="color: #888888;"> Diego López Donaire</span></strong><br /><br /><div class="article_autor">Muy Interesante</div><div class="article_fecha">29/09/2010</div></td>

notice the image tag at the beginning is still unchanged shouldn't it have <div><p><img ....... ></p></div> ?

Starson17 · 09-29-2010, 02:21 PM

I'm not sure what you're asking. The images appear in the html produced with your code and my changes - they don't appear in your code without them. The img tag appears in my print of the newdiv tag with my changes, but not with your code. Do you want me to post your code with my changes, as tested?

TonytheBookworm · 09-29-2010, 02:49 PM

Quote:

Originally Posted by Starson17

I'm not sure what you're asking. The images appear in the html produced with your code and my changes - they don't appear in your code without them. The img tag appears in my print of the newdiv tag with my changes, but not with your code. Do you want me to post your code with my changes, as tested?

if you don't mind cause i would like to see what I'm doing wrong. thanks. as for the issue at hand. the dang image wrapped around the text Like the original poster mentioned in his screenshot. I figured to solve the problem i would simply remove the tables and then enclose the image tag inside a div tag or p tag. didn't work that well

here is the code i am using:

Spoiler:

Code:

from calibre.web.feeds.recipes import BasicNewsRecipe
from BeautifulSoup import BeautifulSoup, Tag

class RevistaMuyInteresante(BasicNewsRecipe):

    title       = 'Revista Muy Interesante'
    __author__  = 'Jefferson Frantz'
    description = 'Revista de divulgacion'
    timefmt = ' [%d %b, %Y]'
    language = 'es_ES'
    conversion_options = {'linearize_tables' : True}
    keep_only_tags = [dict(name='div', attrs={'class':['article']}),dict(name='td', attrs={'class':['txt_articulo']})]
    remove_tags        = [
                             dict(name=['object','link','script','ul'])
                            ,dict(name='div', attrs={'id':['comment']})
                            ,dict(name='td', attrs={'class':['buttonheading']})
                            ,dict(name='div', attrs={'class':['tags_articles']})
                         ]

    remove_tags_after = dict(name='div', attrs={'class':'tags_articles'})


    


    def nz_parse_section(self, url):
            soup = self.index_to_soup(url)
            div = soup.find(attrs={'class':'contenido'})

            current_articles = []
            for x in div.findAllNext(attrs={'class':['headline']}):
                    a = x.find('a', href=True)
                    if a is None:
                        continue
                    title = self.tag_to_string(a)
                    url = a.get('href', False)
                    if not url or not title:
                        continue
                    if url.startswith('/'):
                         url = 'http://www.muyinteresante.es'+url
                    self.log('\t\tFound article:', title)
                    self.log('\t\t\t', url)
                    current_articles.append({'title': title, 'url':url,
                        'description':'', 'date':''})

            return current_articles


    def parse_index(self):
            feeds = []
            for title, url in [
                ('Historia',
                 'http://www.muyinteresante.es/historia-articulos'),
             ]:
               articles = self.nz_parse_section(url)
               if articles:
                   feeds.append((title, articles))
            return feeds
    
    def preprocess_html(self, soup):
        
        for img_tag in soup.findAll('img'):
            parent_tag = img_tag.parent
            data = img_tag
            new_img_tag = Tag(soup,'img')
            new_img_tag['src'] = img_tag['src']
            data = new_img_tag
           
            
            newdiv = Tag(soup,'div')
            newtag = Tag(soup,'p')
            newtag.insert(0,data)
            newdiv.insert(0,newtag)
            parent_tag.insert(0,newdiv)
            print 'parent tag is: ', parent_tag
            print 'newdiv is: ', newdiv
            print 'data is: ',data
            print 'newtag is: ', newtag
            print 'the soup is: ', soup
            
            
            
        return soup
    
    def postprocess_html(self, soup, first):
        for tag in soup.findAll(name=['table', 'tr', 'td']):
            tag.name = 'div'
        return soup

jefferson_frantz · 10-01-2010, 01:49 AM

Thanks Tony for your help!!!
I tried your first suggestion, but the text didn't move below the image

But, this gave me an idea jeje
So, after some trial and error i found the solution

... possibly not the finest, but it works for me

Here is my new recipe:

Spoiler:

Code:

from calibre.web.feeds.recipes import BasicNewsRecipe
from BeautifulSoup import BeautifulSoup, Tag

class RevistaMuyInteresante(BasicNewsRecipe):

    title       = 'Revista Muy Interesante'
    __author__  = 'Jefferson Frantz'
    description = 'Revista de divulgacion'
    timefmt = ' [%d %b, %Y]'
    language = 'es_ES'

    no_stylesheets = True


    #then we add our own style(s) like this:
    extra_css = '''
                       .contentheading{font-weight: bold}
                       p {font-size: 4px;font-family: Times New Roman;}
                    '''

    ###########################################################
    #this right here gets rid of all the inline styles that prevent extra_css from working a lot
    #of times....
    ###########################################################
    def preprocess_html(self, soup):
            for item in soup.findAll(style=True):
               del item['style']
            return soup

    def preprocess_html(self, soup):
            for img_tag in soup.findAll('img'):
                parent_tag = img_tag.parent
                if parent_tag.name == 'td':
                    if not parent_tag.get('class') == 'txt_articulo': break
                    imagen = img_tag
                    new_tag = Tag(soup,'p')
                    img_tag.replaceWith(new_tag)
                    div = soup.find(attrs={'class':'article_category'})
                    div.insert(0,imagen)
            return soup

    keep_only_tags = [dict(name='div', attrs={'class':['article']}),dict(name='td', attrs={'class':['txt_articulo']})]

    remove_tags        = [
                             dict(name=['object','link','script','ul'])
                            ,dict(name='div', attrs={'id':['comment']})
                            ,dict(name='td', attrs={'class':['buttonheading']})
                            ,dict(name='div', attrs={'class':['tags_articles']})
                         ]

    remove_tags_after = dict(name='div', attrs={'class':'tags_articles'})


    #TO GET ARTICLES IN SECTION
    def nz_parse_section(self, url):
            soup = self.index_to_soup(url)
            div = soup.find(attrs={'class':'contenido'})
            current_articles = []
            for x in div.findAllNext(attrs={'class':['headline']}):
                    a = x.find('a', href=True)
                    if a is None:
                        continue
                    title = self.tag_to_string(a)
                    url = a.get('href', False)
                    if not url or not title:
                        continue
                    if url.startswith('/'):
                         url = 'http://www.muyinteresante.es'+url
#                    self.log('\t\tFound article:', title)
#                    self.log('\t\t\t', url)
                    current_articles.append({'title': title, 'url':url,
                        'description':'', 'date':''})

            return current_articles


    # To GET SECTIONS
    def parse_index(self):
            feeds = []
            for title, url in [
                ('Historia',
                 'http://www.muyinteresante.es/historia-articulos'),
             ]:
               articles = self.nz_parse_section(url)
               if articles:
                   feeds.append((title, articles))
            return feeds

Thanks again!

PS: The solution for the title with the extra_css works like a charm

Quote:

Originally Posted by TonytheBookworm

as for the image thing do something like this

Spoiler:

and as for the bold title or whatever you add extra_css
so lets say your title was in a <div class='title'>..... </div> tag and you wanted it different
you would do this:

Spoiler:

TonytheBookworm · 10-01-2010, 04:31 PM

I'm glad you figured it out. I battled with it for a while can never could get the image to move but i see how you done it. what i don't understand is why creating the div tags like i was doing didn't work but whatever. If one bullet don't work and the other one does then that is all that matters.

zeener · 11-21-2010, 11:20 AM

Here you have my recipe for Muy Interesante magazine:

http://gazambuja.pastebin.com/t33w40SF

kovidgoyal · 11-21-2010, 11:34 AM

@zeener: Are there some improvements in your version that should be merged into the builtin recipe?

zeener · 11-22-2010, 07:44 AM

Quote:

Originally Posted by kovidgoyal

@zeener: Are there some improvements in your version that should be merged into the builtin recipe?

I think so. Please, I encourage @jefferson_frantz to test my version.

kovidgoyal · 11-22-2010, 12:56 PM

Can you tell us what the improvements are, makes it easier to test.

zeener · 11-22-2010, 02:40 PM

Quote:

Originally Posted by kovidgoyal

Can you tell us what the improvements are, makes it easier to test.

Use RSS.
Get cover from website
Better "look & feel"

kovidgoyal · 11-22-2010, 03:06 PM

Well I've added the get_cover o the builtin recipe, for the rest, let's wait for comments from jefferson_frantz.

11-21-2010, 11:20 AM	#10
zeener Member Posts: 10 Karma: 10 Join Date: Nov 2010 Device: nook	Revista Muy Interesante Here you have my recipe for Muy Interesante magazine: http://gazambuja.pastebin.com/t33w40SF

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
New recipe	kiklop74	Recipes	0	10-05-2010 05:41 PM
New recipe	kiklop74	Recipes	0	10-01-2010 03:42 PM
Recipe Help	lrain5	Calibre	3	05-09-2010 11:42 PM
Recipe Help	hellonewman	Calibre	1	01-23-2010 04:45 AM
Recipe Help Please	estral	Calibre	1	06-11-2009 03:35 PM

09-29-2010, 02:21 PM	#6
Starson17 Wizard Posts: 4,004 Karma: 177841 Join Date: Dec 2009 Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T	I'm not sure what you're asking. The images appear in the html produced with your code and my changes - they don't appear in your code without them. The img tag appears in my print of the newdiv tag with my changes, but not with your code. Do you want me to post your code with my changes, as tested?

10-01-2010, 04:31 PM	#9
TonytheBookworm Addict Posts: 264 Karma: 62 Join Date: May 2010 Device: kindle 2, kindle 3, Kindle fire	I'm glad you figured it out. I battled with it for a while can never could get the image to move but i see how you done it. what i don't understand is why creating the div tags like i was doing didn't work but whatever. If one bullet don't work and the other one does then that is all that matters.

11-21-2010, 11:34 AM	#11
kovidgoyal creator of calibre Posts: 45,664 Karma: 28549046 Join Date: Oct 2006 Location: Mumbai, India Device: Various	@zeener: Are there some improvements in your version that should be merged into the builtin recipe?

11-22-2010, 12:56 PM	#13
kovidgoyal creator of calibre Posts: 45,664 Karma: 28549046 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Can you tell us what the improvements are, makes it easier to test.

11-22-2010, 03:06 PM	#15
kovidgoyal creator of calibre Posts: 45,664 Karma: 28549046 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Well I've added the get_cover o the builtin recipe, for the rest, let's wait for comments from jefferson_frantz.

Advert

Advert