View Single Post
Old 08-27-2011, 05:15 PM   #1
macpablus
Enthusiast
macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.macpablus once ate a cherry pie in a record 7 seconds.
 
Posts: 25
Karma: 1896
Join Date: Aug 2011
Device: Kindle 3
Request: Non RSS site recipe (Argentinean newspaper)

Hi everyone.

First of all, thanks very much for the marvellous job with Calibre.

One of the things that attracts me the most is the possibillity to access various source of information trough "Fetch news".

Now, going straight to the point, one of the sites I use to read everyday is the one belonging to argentinean newspaper PAGINA12. But I'm not convinced with the way the default recipe handles it content. For one reason: the first section that appears ("Edición Impresa"), usually contains too many articles, in fact belonging to actua (and different)l sections of the newspaper.

So, I decided to, at least, try to make a recipe of my own, having the one from THE ATLANTIC as a starting point. With no success until now. :-(

The index file for PAGINA12 is this, and for THE ATLANTIC is this

The basic problem, I think, is that I cannot manage to "translate" the HTML tags that point to the different sections. I understand that this lines of codes are the key...

Quote:
for section in soup.findAll('div', attrs={'class':'magazineSection'}):
section_title = self.tag_to_string(section.find('h2'))
Checking the index from THE ATLANTIC, I soon realize that each section is contained in a DIV called magazineSection, and the name of sections holds a H2 tag.

In PAGINA12's index, DIV's section names is seccionx, and (here's the thing), section's names are between an a tag. Here's an example:

Quote:
<div class="seccionx">

<div class="desplegable_titulo on_principal right"><a href="/diario/economia/index-2011-08-26.html" title="">ECONOMIA</a></div>
<div class="desplegable_boton boton_cerrar" onclick="_toggle('indice_economia')" id="boton_indice_economia">&nbsp;indice</div>
I've tried differente options, but the sections aren't detected (and also the articles, but let's put that aside from now).

Any ideas?

Last edited by macpablus; 08-28-2011 at 11:47 PM.
macpablus is offline   Reply With Quote