Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 02-11-2013, 11:58 AM   #1
JonathanL
Junior Member
JonathanL began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Feb 2013
Device: Kindle Keyboard
Include Author and Publication Date from feed

I have finally decided to start to improve on the simple, fully automatic recipes that Calibre generates. The first thing I would like to do is have the article include the name of the author of the article and the publication date after the title (or in some other logical location, for reference).

I believe this information is included in the feed data and in the page of the article itself. The feed in question is: http://politikon.es/feed/

Many thanks for your help. Perhaps seeing how this is done will help me make heads or tails of the rest.
JonathanL is offline   Reply With Quote
Old 02-12-2013, 06:37 AM   #2
JonathanL
Junior Member
JonathanL began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Feb 2013
Device: Kindle Keyboard
First efforts

Here is my first effort. The articles are pretty clean and auto cleanup works fine. I just wanted to include a line under the title with the date and author. It looks like it is contained in a div class "post-meta", but auto_cleanup_keep is not enough to pull in this data it seems.

Spoiler:

class Politikon(BasicNewsRecipe):
title = u'Politikon - Pol\xedtica, econom\xeda, sociedad y actualidad. (extended)'
oldest_article = 7
max_articles_per_feed = 20
auto_cleanup = True

feeds = [
(u'Politikon - Pol\xedtica, econom\xeda, sociedad y actualidad.', u'http://politikon.es/feed/'),
(u'Ahora','http://politikon.es/category/ahora/feed/'),
(u'Pol\xedtica', 'http://politikon.es/category/politica/feed/'),
(u'Econom\xeda' , 'http://politikon.es/category/economia/feed/'),
(u'Internacional', 'http://politikon.es/category/internacional/feed/'),
(u'Sociedad', 'http://politikon.es/category/sociedad/feed/')
]

no_stylesheets = True

auto_cleanup_keep = '//div[@class="post-meta"]|//abbr[@class="date time published"]|//span[@class="author vcard"]'

ignore_duplicate_articles = {'title', 'url'}


Thanks for any help
JonathanL is offline   Reply With Quote
Advert
Old 02-12-2013, 06:49 AM   #3
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,839
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
If you cant get it to work with auto_cleanup_keep you will have to cleanup manually using the remove_tags keep_only_tags directives instead.
kovidgoyal is offline   Reply With Quote
Old 02-13-2013, 05:40 PM   #4
JonathanL
Junior Member
JonathanL began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Feb 2013
Device: Kindle Keyboard
A little victory

Revisiting this again, I think I have achieved what I wanted. But I have some questions and would appreciate anyone shining some light on the subject.

The solution
All the information I wanted was there in the HTML of the article website. I used auto_cleanup, since it had been working fine, but used auto_cleanup_keep to include all the tags around that information. This was three levels deep in some cases. Also, one of the tags I wanted was <abbr>, which being strange I substitute for a wildcard (*), since I suspect that might have been causing it to fail previously. I also had to choose an unusual attribute for one <span> (rel) since there was no 'id' or 'class' and the title was too specific.
To achieve all this I had to set use_embedded_content=False.
Here's how it came out:
Spoiler:

class Politikon(BasicNewsRecipe):
title = u'Politikon - Pol\xedtica, econom\xeda, sociedad y actualidad. (extended)'
oldest_article = 7
max_articles_per_feed = 20

feeds = [
(u'Politikon - Pol\xedtica, econom\xeda, sociedad y actualidad.', u'http://politikon.es/feed/'),
(u'Ahora','http://politikon.es/category/ahora/feed/'),
(u'Pol\xedtica', 'http://politikon.es/category/politica/feed/'),
(u'Econom\xeda' , 'http://politikon.es/category/economia/feed/'),
(u'Internacional', 'http://politikon.es/category/internacional/feed/'),
(u'Sociedad', 'http://politikon.es/category/sociedad/feed/')
]

no_stylesheets = True

use_embedded_content = False

reverse_article_order = True

auto_cleanup = True

auto_cleanup_keep = '//*[@class="title"]|//div[@class="post-meta"]|//span[@class="author vcard"]|//span[@class="fn"]|//span[@rel="author"]|//*[@class="date time published"]'

extra_css = '.title {font-size: 150%; text-align:center}'

remove_empty_feeds = True

ignore_duplicate_articles = {'title', 'url'}


The problems
As I said, I had to force Calibre not to use the embedded content, although it is all there and I can identify the bits of information I want very easily in the source of the RSS feed. Applying the same technique, however, does not yield the results I want. I don't understand how Calibre is picking up and using the tags from the RSS source. I am not a programmer and from what I have read I cannot understand enough of what is going on behind the scenes. Enabling a few HTML tags I get, but the RSS content surely requires more processing.

Cheers for any advice/pointers regarding the RSS issue. Since the data is in the feed it seems preferable to use it from there.
JonathanL is offline   Reply With Quote
Reply

Tags
author, publication date

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Publication Date or Copyright Date or ??? hd_cal_dave Library Management 8 05-25-2012 01:50 PM
Problem: Date of Publication... samy2 Calibre 2 03-02-2012 05:09 AM
How to Include Date in Title? awitko Recipes 2 11-02-2011 04:40 PM
Date of Publication Metadata crutledge Sigil 5 01-10-2011 01:27 PM
Is there any way to control publication date? weasal Recipes 4 09-27-2010 12:37 PM


All times are GMT -4. The time now is 03:30 PM.


MobileRead.com is a privately owned, operated and funded community.