View Single Post
Old 02-13-2013, 05:40 PM   #4
JonathanL
Junior Member
JonathanL began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Feb 2013
Device: Kindle Keyboard
A little victory

Revisiting this again, I think I have achieved what I wanted. But I have some questions and would appreciate anyone shining some light on the subject.

The solution
All the information I wanted was there in the HTML of the article website. I used auto_cleanup, since it had been working fine, but used auto_cleanup_keep to include all the tags around that information. This was three levels deep in some cases. Also, one of the tags I wanted was <abbr>, which being strange I substitute for a wildcard (*), since I suspect that might have been causing it to fail previously. I also had to choose an unusual attribute for one <span> (rel) since there was no 'id' or 'class' and the title was too specific.
To achieve all this I had to set use_embedded_content=False.
Here's how it came out:
Spoiler:

class Politikon(BasicNewsRecipe):
title = u'Politikon - Pol\xedtica, econom\xeda, sociedad y actualidad. (extended)'
oldest_article = 7
max_articles_per_feed = 20

feeds = [
(u'Politikon - Pol\xedtica, econom\xeda, sociedad y actualidad.', u'http://politikon.es/feed/'),
(u'Ahora','http://politikon.es/category/ahora/feed/'),
(u'Pol\xedtica', 'http://politikon.es/category/politica/feed/'),
(u'Econom\xeda' , 'http://politikon.es/category/economia/feed/'),
(u'Internacional', 'http://politikon.es/category/internacional/feed/'),
(u'Sociedad', 'http://politikon.es/category/sociedad/feed/')
]

no_stylesheets = True

use_embedded_content = False

reverse_article_order = True

auto_cleanup = True

auto_cleanup_keep = '//*[@class="title"]|//div[@class="post-meta"]|//span[@class="author vcard"]|//span[@class="fn"]|//span[@rel="author"]|//*[@class="date time published"]'

extra_css = '.title {font-size: 150%; text-align:center}'

remove_empty_feeds = True

ignore_duplicate_articles = {'title', 'url'}


The problems
As I said, I had to force Calibre not to use the embedded content, although it is all there and I can identify the bits of information I want very easily in the source of the RSS feed. Applying the same technique, however, does not yield the results I want. I don't understand how Calibre is picking up and using the tags from the RSS source. I am not a programmer and from what I have read I cannot understand enough of what is going on behind the scenes. Enabling a few HTML tags I get, but the RSS content surely requires more processing.

Cheers for any advice/pointers regarding the RSS issue. Since the data is in the feed it seems preferable to use it from there.
JonathanL is offline   Reply With Quote