Help for populate_article_metadata

ajmoraal · 11-15-2010, 02:53 PM

I'm working on a recipe for a certain site, that has the publication date and author on the article pages only, not on the index page.
So I though I could subclass populate_article_metadata to set this data in the article object like this:

Code:

def populate_article_metadata(self, article, soup, first):
	article.date = soup.find('div', {"class": "date"}).contents[0].strip()
	article.author = soup.find('div', {"class": "author"}).contents[0].strip()

It doesn't work however, as I now get the following error for every article it tries to download:

Code:

3% Article download failed: u'Some article'
Could not fetch link http://www.somedomain.com/somearticle

Any idea what I'm doing wrong?

Starson17 · 11-15-2010, 03:24 PM

Quote:

Originally Posted by ajmoraal

It doesn't work ..
Any idea what I'm doing wrong?

populate_article_metadata is used in the _postprocess_html method of news.py. At that point, I'm pretty sure the feeds have already been parsed and the index page for each feed has already been constructed. The news system is at the end of processing an article page. I don't think you can expect to set article.date for the index page at this point and have it appear on the index page for the associated article. (If that's what you are trying to do)

ajmoraal · 11-15-2010, 03:38 PM

Quote:

Originally Posted by Starson17

I don't think you can expect to set article.date for the index page at this point and have it appear on the index page for the associated article. (If that's what you are trying to do)

Sort of. I was trying to filter out all articles that are older than 2 days. As there's no indication of the age on the index page, I had to take it from the article page.

Anyway, thanks for the clarification.

Starson17 · 11-15-2010, 03:53 PM

Quote:

Originally Posted by ajmoraal

Sort of. I was trying to filter out all articles that are older than 2 days. As there's no indication of the age on the index page, I had to take it from the article page.

Anyway, thanks for the clarification.

I'm not sure I clarified it much, but if it's of any help, I've never seen populate_article_metadata used in any recipe. The Article object has these elements:

Code:

Title       : 
URL         :
Author      :
Summary     :
Date        :
Has content :

I'm not totally sure how you would use populate_article_metadata. For example, I wouldn't think you could change the URL. since you'd already used it to download the article.

I have seen information passed from the index page to the article, but not the other way around. It should be possible, but I'd think you would have to construct the feed manually using parse_feeds by first parsing the feed, then grabbing any additional info you want from each article page, then building the Feed object you want and returning that.

edit:Perhaps Kovid will give us some insight.

kovidgoyal · 11-16-2010, 11:57 AM

The use of populate_article_metadata is correct, you need to post the full error message for me to help you (i.e. the traceback, not just the message saying article failed to download)

ajmoraal · 11-16-2010, 03:34 PM

Unfortunately it doesn't generate a stacktrace - that's why I'm a bit stuck. I just prints a warning for every article and then it goes on to the next article.

I've attached the recipe, in case you want to try it.
I'm running it on Calibre 0.7.7 as packaged in Debian Squeeze.

kovidgoyal · 11-16-2010, 03:38 PM

stack traces are only printed if you run in verbose mode (i.e. with -vv)

ajmoraal · 11-16-2010, 03:56 PM

Thanks, I managed to solve it now by looking at the stacktrace.
The issue was I was searching for html tags that were already stripped off (via the remove_tags_before and remove_tags_after), so contents[0] was called on the NoneType.

Btw, shouldn't "ebook-convert -?" return info about the -vv option?

Starson17 · 11-16-2010, 03:57 PM

Quote:

Originally Posted by kovidgoyal

The use of populate_article_metadata is correct

This is interesting. I had never seen populate_article_metadata used in anything, so I played around with it a bit. You certainly can modify the index page. I easily changed article.title and article.text_summary. Those changes appeared on the index page.

I could also change article.author (although I don't see it used anywhere, so I'm not sure why you would want to change it.) When I tried to change article.date, it seemed to accept it (no errors), but it didn't appear on the index page. I used:

Code:

article.date = datetime.datetime.now()

When I intentionally used the wrong date format, I got "Could not fetch link" errors. I suspect the date format was wrong when using:

Code:

article.date = soup.find('div', {"class": "date"}).contents[0].strip()

Even if the date format was correct, I'm not sure if it would change the index page? Nothing I did would change it.

kovidgoyal · 11-16-2010, 04:12 PM

@ajmoraal: ebook-convert test.recipe .epub -h

is what you are looking for.

@starson17: article author info is used when creating special periodical downloads for the Kindle and/or SONY. As for date not changing, could be any number of things, I haven't got the time right now to look into it

Starson17 · 11-16-2010, 04:24 PM

Quote:

Originally Posted by kovidgoyal

@starson17: article author info is used when creating special periodical downloads for the Kindle and/or SONY.

Thanks for that bit of info.

Quote:

As for date not changing, could be any number of things, I haven't got the time right now to look into it

I have no immediate use for this anyway. For all I know, I did something wrong during the testing.

11-15-2010, 02:53 PM	#1
ajmoraal Junior Member Posts: 4 Karma: 10 Join Date: Nov 2010 Device: BeBook One 2010	Help for populate_article_metadata I'm working on a recipe for a certain site, that has the publication date and author on the article pages only, not on the index page. So I though I could subclass populate_article_metadata to set this data in the article object like this: Code: def populate_article_metadata(self, article, soup, first): article.date = soup.find('div', {"class": "date"}).contents[0].strip() article.author = soup.find('div', {"class": "author"}).contents[0].strip() It doesn't work however, as I now get the following error for every article it tries to download: Code: 3% Article download failed: u'Some article' Could not fetch link http://www.somedomain.com/somearticle Any idea what I'm doing wrong?

11-16-2010, 04:12 PM	#10
kovidgoyal creator of calibre Posts: 45,328 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	@ajmoraal: ebook-convert test.recipe .epub -h is what you are looking for. @starson17: article author info is used when creating special periodical downloads for the Kindle and/or SONY. As for date not changing, could be any number of things, I haven't got the time right now to look into it Last edited by kovidgoyal; 11-16-2010 at 04:31 PM.

11-16-2010, 11:57 AM	#5
kovidgoyal creator of calibre Posts: 45,328 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	The use of populate_article_metadata is correct, you need to post the full error message for me to help you (i.e. the traceback, not just the message saying article failed to download)

11-16-2010, 03:38 PM	#7
kovidgoyal creator of calibre Posts: 45,328 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	stack traces are only printed if you run in verbose mode (i.e. with -vv)

11-16-2010, 03:56 PM	#8
ajmoraal Junior Member Posts: 4 Karma: 10 Join Date: Nov 2010 Device: BeBook One 2010	Thanks, I managed to solve it now by looking at the stacktrace. The issue was I was searching for html tags that were already stripped off (via the remove_tags_before and remove_tags_after), so contents[0] was called on the NoneType. Btw, shouldn't "ebook-convert -?" return info about the -vv option?

Advert

Advert