MobileRead Forums - View Single Post - Include Author and Publication Date from feed

JonathanL · 02-13-2013, 05:40 PM

Revisiting this again, I think I have achieved what I wanted. But I have some questions and would appreciate anyone shining some light on the subject.

The solution
All the information I wanted was there in the HTML of the article website. I used auto_cleanup, since it had been working fine, but used auto_cleanup_keep to include all the tags around that information. This was three levels deep in some cases. Also, one of the tags I wanted was <abbr>, which being strange I substitute for a wildcard (*), since I suspect that might have been causing it to fail previously. I also had to choose an unusual attribute for one <span> (rel) since there was no 'id' or 'class' and the title was too specific.
To achieve all this I had to set use_embedded_content=False.
Here's how it came out:

Spoiler:

The problems
As I said, I had to force Calibre not to use the embedded content, although it is all there and I can identify the bits of information I want very easily in the source of the RSS feed. Applying the same technique, however, does not yield the results I want. I don't understand how Calibre is picking up and using the tags from the RSS source. I am not a programmer and from what I have read I cannot understand enough of what is going on behind the scenes. Enabling a few HTML tags I get, but the RSS content surely requires more processing.

Cheers for any advice/pointers regarding the RSS issue. Since the data is in the feed it seems preferable to use it from there.

02-13-2013, 05:40 PM	#4
JonathanL Junior Member Posts: 3 Karma: 10 Join Date: Feb 2013 Device: Kindle Keyboard	A little victory Revisiting this again, I think I have achieved what I wanted. But I have some questions and would appreciate anyone shining some light on the subject. The solution All the information I wanted was there in the HTML of the article website. I used auto_cleanup, since it had been working fine, but used auto_cleanup_keep to include all the tags around that information. This was three levels deep in some cases. Also, one of the tags I wanted was <abbr>, which being strange I substitute for a wildcard (), since I suspect that might have been causing it to fail previously. I also had to choose an unusual attribute for one <span> (rel) since there was no 'id' or 'class' and the title was too specific. To achieve all this I had to set use_embedded_content=False. Here's how it came out: Spoiler: class Politikon(BasicNewsRecipe): title = u'Politikon - Pol\xedtica, econom\xeda, sociedad y actualidad. (extended)' oldest_article = 7 max_articles_per_feed = 20 feeds = [ (u'Politikon - Pol\xedtica, econom\xeda, sociedad y actualidad.', u'http://politikon.es/feed/'), (u'Ahora','http://politikon.es/category/ahora/feed/'), (u'Pol\xedtica', 'http://politikon.es/category/politica/feed/'), (u'Econom\xeda' , 'http://politikon.es/category/economia/feed/'), (u'Internacional', 'http://politikon.es/category/internacional/feed/'), (u'Sociedad', 'http://politikon.es/category/sociedad/feed/') ] no_stylesheets = True use_embedded_content = False reverse_article_order = True auto_cleanup = True auto_cleanup_keep = '//[@class="title"]\|//div[@class="post-meta"]\|//span[@class="author vcard"]\|//span[@class="fn"]\|//span[@rel="author"]\|//[@class="date time published"]' extra_css = '.title {font-size: 150%; text-align:center}' remove_empty_feeds = True ignore_duplicate_articles = {'title', 'url'} The problems* As I said, I had to force Calibre not to use the embedded content, although it is all there and I can identify the bits of information I want very easily in the source of the RSS feed. Applying the same technique, however, does not yield the results I want. I don't understand how Calibre is picking up and using the tags from the RSS source. I am not a programmer and from what I have read I cannot understand enough of what is going on behind the scenes. Enabling a few HTML tags I get, but the RSS content surely requires more processing. Cheers for any advice/pointers regarding the RSS issue. Since the data is in the feed it seems preferable to use it from there.