Changing article titles in recipes

tbaac · 12-15-2010, 04:16 PM

I have a new Kindle 3. I have used Calibre before (with my old PRS 505) but it works so well with the Kindle. I set it to download The Guardian, and email it to my Kindle and it magically appeared on my device

I have done some searching and understand that there is no TOC on the Kindle for Periodicals. I want to keep the recipe as a periodical as that way the Kindle handles new versions. But I'd also like some indication in the article title of which feed the article came from.

The feeds in the built in recipe are:

feeds = [
('Front Page', 'http://www.guardian.co.uk/rss'),
('Business', 'http://www.guardian.co.uk/business/rss'),
('Sport', 'http://www.guardian.co.uk/sport/rss'),
('Culture', 'http://www.guardian.co.uk/culture/rss'),
('Money', 'http://www.guardian.co.uk/money/rss'),
('Life & Style', 'http://www.guardian.co.uk/lifeandstyle/rss'),
('Travel', 'http://www.guardian.co.uk/travel/rss'),
('Environment', 'http://www.guardian.co.uk/environment/rss'),
('Comment','http://www.guardian.co.uk/commentisfree/rss'),
]

And I'd like to append the feed name to the end of the article name, as an indication as to which section it came from. Is this easy to do?

I tried adding the url to the end of title in this section:

yield {
'title': title, 'url':url, 'description':desc,
'date' : strftime('%a, %d %b'),
}

But perhaps didn't do it right and perhaps its the wrong section

So has anyone tried this in recipes, do they have a better suggestion or is it a silly idea?
I was trying to use the url, but the name of the feed would be better if possible.

Thank you.

Starson17 · 12-16-2010, 02:31 PM

Quote:

Originally Posted by tbaac

I'd also like some indication in the article title of which feed the article came from....

So has anyone tried this in recipes, do they have a better suggestion or is it a silly idea?

There are two places you could modify the article title by inserting the feed title. You could do this in parse_feeds or in populate_article_metadata. Take a look at this post:
https://www.mobileread.com/forums/sho...62&postcount=6
It will show you how to access the article title. I'm pretty sure you can grab the feed title there as well, then concatenate it onto the article title.

You can probably also do this in populate_article_metadata, but it's not used much. Look at the API for info on it.

tbaac · 12-18-2010, 08:20 PM

Thanks for that Starson17.

So you're saying that within parse_feeds I should be able to retrieve and then set the value of article.title?

Hopefully there is a variable feed.title which contains the feed name to add to the article title.

I won't get to my laptop until Monday but I'll give it a go then. Thanks again :-)

tbaac · 12-21-2010, 09:40 AM

Thanks for that. I ended up with this:

def parse_feeds (self):
feeds = BasicNewsRecipe.parse_feeds(self)
for feed in feeds:
for article in feed.articles[:]:
feedps = feed.title + ' '
newtitle = feedps + article.title
article.title = newtitle
print 'New article title is: ', article.title
return feeds

Edit: The indenting looks correct in the above code segment but not when its shown in the post...

Unfortunately it seems to have no effect and the print line doesn't seem to get run.

I've noticed that in the existing Guardian recipe and in the Wikileaks recipe that someone nicely posted I get an error:

Parsing feed_1/index.html ...
Initial parse failed:
Traceback (most recent call last):
File "/usr/lib/calibre/calibre/ebooks/oeb/base.py", line 816, in first_pass
data = etree.fromstring(data, parser=parser)
File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48634)
File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:72245)
File "parser.pxi", line 1417, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:71041)
File "parser.pxi", line 898, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:67581)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64257)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65178)
File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64521)
XMLSyntaxError: Opening and ending tag mismatch: hr line 38 and div, line 39, column 7

I'm running the latest version on Ubuntu linux. Anyone had this error and know a solution?
The recipes still create output.

Thanks.

Starson17 · 12-21-2010, 10:44 AM

Quote:

Originally Posted by tbaac

Thanks for that. I ended up with this:

Code:

    def parse_feeds (self): 
          feeds = BasicNewsRecipe.parse_feeds(self) 
          for feed in feeds:
              for article in feed.articles[:]:
                 feedps = feed.title + ' '
                 newtitle = feedps + article.title
                 article.title = newtitle
                 print 'New article title is: ', article.title
          return feeds

Edit: The indenting looks correct in the above code segment but not when its shown in the post...

It's correct now. Use the hash/pound mark to apply CODE tags around and preserve indenting.

Quote:

Unfortunately it seems to have no effect and the print line doesn't seem to get run.

I pasted your code into a working recipe and it worked perfectly. The article titles had the feed title prepended, and the print worked.

Quote:

I've noticed that in the existing Guardian recipe and in the Wikileaks recipe that someone nicely posted I get an error:

Quote:

XMLSyntaxError: Opening and ending tag mismatch: hr line 38 and div, line 39, column 7

I'm running the latest version on Ubuntu linux. Anyone had this error and know a solution?
The recipes still create output.

It looks like a bad page with bad html, not a recipe error.

tbaac · 12-21-2010, 10:51 AM

Thanks for the swift reply Starson17.

I agree (with my limited experience) that it appeared to not be a recipe problem because it happens with the built in Guardian recipe and with the Wikileaks recipe posted in this forum. It just seemed to be something which in my case was preventing the added code from working.
I'd never noticed the error before, to see it I had to look at job details.

When you say "page with bad html" you mean that the html from the RSS feed is bad and that there's nothing that I can do about it?

Starson17 · 12-21-2010, 11:27 AM

Quote:

Originally Posted by tbaac

When you say "page with bad html" you mean that the html from the RSS feed is bad and that there's nothing that I can do about it?

That is what I meant. I've seen that error before and ignored it. I can't tell you if what I meant is in fact correct. It was just a guess and without doing some tests, I don't really know if my guess is correct. Feel free to test more and report

So why did your code not work for you?

tbaac · 12-22-2010, 10:53 AM

Hi again Starson17. I was assuming that as the error referred to a problem parsing and the code that I added related to parsing, it had fallen over after retrieving all the articles but prior to doing any parsing. Which recipe did you try the code with if you don't mind me asking? I had the error with the base Guardian recipe, although it did not cause any problem with the retrieval.

I also noticed that the error seems to mention the "div" tag and the div tag is mentioned in the "remove_tags" part of the recipe so I wondered if it was a slight (but usually non problematic) problem with the recipe?

I also had a look at populate_article_metadata but couldn't see if I'd have access to the feed name at that time?

Thanks again.

Starson17 · 12-22-2010, 12:03 PM

Quote:

Originally Posted by tbaac

Which recipe did you try the code with if you don't mind me asking?

I keep a test recipe and batch file all set up to paste code into when trying to give assistance. I just pasted your code into the end of whatever was already in that recipe. It happened to be SkepticBlog. I knew it worked before pasting in your code. Feel free to test it yourself.

Code:

#!/usr/bin/env  python
__license__   = 'GPL v3'
import re
from calibre.web.feeds.news import BasicNewsRecipe

class SkepticBlog(BasicNewsRecipe):
    oldest_article        = 5
    max_articles_per_feed = 15
    no_stylesheets        = True
    use_embedded_content  = False
    encoding              = 'utf-8'
    publisher             = 'Skeptic Magazine'
    category              = 'science, pseudoscience'

    def get_browser(self):
        br = BasicNewsRecipe.get_browser(self)
        br.addheaders = [('Accept', 'text/html')]
        return br

    feeds = [(u'SkepticBlog', u'http://skepticblog.org/feed')]

    def parse_feeds (self): 
          feeds = BasicNewsRecipe.parse_feeds(self) 
          for feed in feeds:
              for article in feed.articles[:]:
                 print 'New1 article title is: ', article.title
                 feedps = feed.title + ' '
                 newtitle = feedps + article.title
                 article.title = newtitle
                 print 'New2 article title is: ', article.title
          return feeds

12-15-2010, 04:16 PM	#1
tbaac Member Posts: 11 Karma: 10 Join Date: Feb 2009 Device: Sony PRS505	Changing article titles in recipes I have a new Kindle 3. I have used Calibre before (with my old PRS 505) but it works so well with the Kindle. I set it to download The Guardian, and email it to my Kindle and it magically appeared on my device I have done some searching and understand that there is no TOC on the Kindle for Periodicals. I want to keep the recipe as a periodical as that way the Kindle handles new versions. But I'd also like some indication in the article title of which feed the article came from. The feeds in the built in recipe are: feeds = [ ('Front Page', 'http://www.guardian.co.uk/rss'), ('Business', 'http://www.guardian.co.uk/business/rss'), ('Sport', 'http://www.guardian.co.uk/sport/rss'), ('Culture', 'http://www.guardian.co.uk/culture/rss'), ('Money', 'http://www.guardian.co.uk/money/rss'), ('Life & Style', 'http://www.guardian.co.uk/lifeandstyle/rss'), ('Travel', 'http://www.guardian.co.uk/travel/rss'), ('Environment', 'http://www.guardian.co.uk/environment/rss'), ('Comment','http://www.guardian.co.uk/commentisfree/rss'), ] And I'd like to append the feed name to the end of the article name, as an indication as to which section it came from. Is this easy to do? I tried adding the url to the end of title in this section: yield { 'title': title, 'url':url, 'description':desc, 'date' : strftime('%a, %d %b'), } But perhaps didn't do it right and perhaps its the wrong section So has anyone tried this in recipes, do they have a better suggestion or is it a silly idea? I was trying to use the url, but the name of the feed would be better if possible. Thank you.

12-21-2010, 09:40 AM	#4
tbaac Member Posts: 11 Karma: 10 Join Date: Feb 2009 Device: Sony PRS505	Thanks for that. I ended up with this: def parse_feeds (self): feeds = BasicNewsRecipe.parse_feeds(self) for feed in feeds: for article in feed.articles[:]: feedps = feed.title + ' ' newtitle = feedps + article.title article.title = newtitle print 'New article title is: ', article.title return feeds Edit: The indenting looks correct in the above code segment but not when its shown in the post... Unfortunately it seems to have no effect and the print line doesn't seem to get run. I've noticed that in the existing Guardian recipe and in the Wikileaks recipe that someone nicely posted I get an error: Parsing feed_1/index.html ... Initial parse failed: Traceback (most recent call last): File "/usr/lib/calibre/calibre/ebooks/oeb/base.py", line 816, in first_pass data = etree.fromstring(data, parser=parser) File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48634) File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:72245) File "parser.pxi", line 1417, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:71041) File "parser.pxi", line 898, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:67581) File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64257) File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65178) File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64521) XMLSyntaxError: Opening and ending tag mismatch: hr line 38 and div, line 39, column 7 I'm running the latest version on Ubuntu linux. Anyone had this error and know a solution? The recipes still create output. Thanks. Last edited by tbaac; 12-21-2010 at 09:42 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
CNN article on Changing/Future Libraries	kennyc	News	0	09-04-2009 02:32 PM
WSJ Article on Ebooks Changing the way we read and write	robynebr	News	0	04-22-2009 03:16 PM
Changing Book Titles in the Library	MickeyC	Sony Reader	3	06-15-2008 07:37 AM
Changing titles on threads	Strether	Upload Help	3	04-20-2008 09:22 PM

12-18-2010, 08:20 PM	#3
tbaac Member Posts: 11 Karma: 10 Join Date: Feb 2009 Device: Sony PRS505	Thanks for that Starson17. So you're saying that within parse_feeds I should be able to retrieve and then set the value of article.title? Hopefully there is a variable feed.title which contains the feed name to add to the article title. I won't get to my laptop until Monday but I'll give it a go then. Thanks again :-)

12-21-2010, 10:51 AM	#6
tbaac Member Posts: 11 Karma: 10 Join Date: Feb 2009 Device: Sony PRS505	Thanks for the swift reply Starson17. I agree (with my limited experience) that it appeared to not be a recipe problem because it happens with the built in Guardian recipe and with the Wikileaks recipe posted in this forum. It just seemed to be something which in my case was preventing the added code from working. I'd never noticed the error before, to see it I had to look at job details. When you say "page with bad html" you mean that the html from the RSS feed is bad and that there's nothing that I can do about it?

12-22-2010, 10:53 AM	#8
tbaac Member Posts: 11 Karma: 10 Join Date: Feb 2009 Device: Sony PRS505	Hi again Starson17. I was assuming that as the error referred to a problem parsing and the code that I added related to parsing, it had fallen over after retrieving all the articles but prior to doing any parsing. Which recipe did you try the code with if you don't mind me asking? I had the error with the base Guardian recipe, although it did not cause any problem with the retrieval. I also noticed that the error seems to mention the "div" tag and the div tag is mentioned in the "remove_tags" part of the recipe so I wondered if it was a slight (but usually non problematic) problem with the recipe? I also had a look at populate_article_metadata but couldn't see if I'd have access to the feed name at that time? Thanks again.

Advert

Advert