Article Dates with parse_index

EnergyLens · 03-28-2010, 09:02 PM

I've been hunting for a recipe example that takes a date parsed from html and converts it into the proper format so that the article date displays correctly.

It seems that all the examples append
'date':''

I can't find anything in the documentation that specifies what format to use, and it doesn't work when I append, for exmple:

articles.append({'title':title, 'url':url, 'description':desc, 'date':'Thursday, July 12, 2007'})

kovidgoyal · 03-29-2010, 05:10 AM

The date you set above is, IIRC, used only in the index of articles in any given section. What date are ou trying to set? The date used in the title of the downloaded ebook?

EnergyLens · 03-29-2010, 07:32 AM

I am trying to set the date for the article that is shown after the title in the article index... but it always shows the time of creation rather than a date that I attempt to set. I assumed that I had an incorrect date format and that was why it was not being set.

Ultimately I am hoping that dates I set for the articles can be used by the recipe (oldest_article) to determine whether or not to include an article from the "feed" I've created with parse_index

kovidgoyal · 03-29-2010, 12:55 PM

old_articles is only used for RSS processing. If you are writing a parse_index yourself, just compare the dates and skip tho old articles yourself.

EnergyLens · 04-13-2010, 11:52 AM

Back to the original topic, it doesn't appear that:

articles.append({'title':title, 'url':url, 'description':desc, 'date':'Thursday, July 12, 2007'})

actually sets the date in the index of articles. The index of articles, at least using parse_index, *always* uses the date/time at the moment of creation.

------------------------------
FYI: I'm taking a directory of saved web pages and using ebook-convert to convert them all into an epub:
------------------------------

#!/usr/bin/env python

__license__ = 'GPL v3'
'''
Directory to Epub
'''
import string
import time

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

class ImportDirectory(BasicNewsRecipe):

title = 'Energy Bulletin'
description = 'EnergyBulletin.net is a clearinghouse for information regarding the peak in global energy supply.'
INDEX = 'http://localhost/~myaccount/Scrapbook/'
language = 'en'
keep_only_tags = [dict(id='main_content')]
remove_tags = [dict(name='div', attrs={'class':'links'})]

no_stylesheets = True

def parse_index(self):
articles = []

soup = self.index_to_soup(self.INDEX)

feeds = []
for node in soup.findAll('tr'):
x = node.find('img',attrs={'src':'/icons/folder.gif'})
a = node.find('a', href=True)
if a is not None and x is not None:
url = a['href']
url = 'http://localhost/~charlesallen/Scrapbook/'+url
desc = None
newsoup = self.index_to_soup(url)
if newsoup is not None:
atitle = newsoup.find('title')
title = self.tag_to_string(atitle)
adate = newsoup.find('span',attrs={'class':'date-display-single'})
pubdt = self.tag_to_string(adate)
mytime = time.strptime(pubdt,"%b %d %Y")
dt = time.strftime('%A, %d %B, %Y',mytime)
origin = newsoup.find('div',attrs={'class':'origin'})
author = self.tag_to_string(origin)
self.log('\tFound article ',title,' at ', url, 'origin: ',author)
articles.append({'title':title, 'url':url, 'description':'','date':dt})

feeds.append(('Articles', articles))

return feeds

kovidgoyal · 04-14-2010, 01:56 AM

Maybe that's the case, I'll have to look at the code to be sure. Open a ticket and I'll get to it when I have some time.

EnergyLens · 04-21-2010, 10:13 PM

I'm beginning to suspect that ebook-convert also ignores --level1-toc= & etc. directives when parse_index is used. I've gotten --levelX-toc to work fine when converting .txt documents and individual .html documents to .epub, but cannot make it work with recipes that use parse_index.

Perhaps I'm not understanding something, but I expected it to build a TOC from the Xpath matches in each article returned in feeds.

p.s. am I right that --foo= is the only command line argument that ebook-convert will accept apart from those documented? I was trying to pass command line date to my recipes and just happened to use --foo= the first time and it worked. all other attempts to pass command line variables cause ebook-convert to stop with an exception that there is no such option. Ah, the power of FOO! (please don't remove --foo= as that is my only way to pass my own command line arguments !-)

03-28-2010, 09:02 PM	#1
EnergyLens Hack Posts: 34 Karma: 12 Join Date: Dec 2009 Device: Kobo Aura HD, Kindle Paperwhite	Article Dates with parse_index I've been hunting for a recipe example that takes a date parsed from html and converts it into the proper format so that the article date displays correctly. It seems that all the examples append 'date':'' I can't find anything in the documentation that specifies what format to use, and it doesn't work when I append, for exmple: articles.append({'title':title, 'url':url, 'description':desc, 'date':'Thursday, July 12, 2007'})

03-29-2010, 07:32 AM	#3
EnergyLens Hack Posts: 34 Karma: 12 Join Date: Dec 2009 Device: Kobo Aura HD, Kindle Paperwhite	Article Date I am trying to set the date for the article that is shown after the title in the article index... but it always shows the time of creation rather than a date that I attempt to set. I assumed that I had an incorrect date format and that was why it was not being set. Ultimately I am hoping that dates I set for the articles can be used by the recipe (oldest_article) to determine whether or not to include an article from the "feed" I've created with parse_index

04-13-2010, 11:52 AM	#5
EnergyLens Hack Posts: 34 Karma: 12 Join Date: Dec 2009 Device: Kobo Aura HD, Kindle Paperwhite	parse_index and setting dates for index of articles Back to the original topic, it doesn't appear that: articles.append({'title':title, 'url':url, 'description':desc, 'date':'Thursday, July 12, 2007'}) actually sets the date in the index of articles. The index of articles, at least using parse_index, always uses the date/time at the moment of creation. ------------------------------ FYI: I'm taking a directory of saved web pages and using ebook-convert to convert them all into an epub: ------------------------------ #!/usr/bin/env python __license__ = 'GPL v3' ''' Directory to Epub ''' import string import time from calibre.web.feeds.news import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import Tag, NavigableString class ImportDirectory(BasicNewsRecipe): title = 'Energy Bulletin' description = 'EnergyBulletin.net is a clearinghouse for information regarding the peak in global energy supply.' INDEX = 'http://localhost/~myaccount/Scrapbook/' language = 'en' keep_only_tags = [dict(id='main_content')] remove_tags = [dict(name='div', attrs={'class':'links'})] no_stylesheets = True def parse_index(self): articles = [] soup = self.index_to_soup(self.INDEX) feeds = [] for node in soup.findAll('tr'): x = node.find('img',attrs={'src':'/icons/folder.gif'}) a = node.find('a', href=True) if a is not None and x is not None: url = a['href'] url = 'http://localhost/~charlesallen/Scrapbook/'+url desc = None newsoup = self.index_to_soup(url) if newsoup is not None: atitle = newsoup.find('title') title = self.tag_to_string(atitle) adate = newsoup.find('span',attrs={'class':'date-display-single'}) pubdt = self.tag_to_string(adate) mytime = time.strptime(pubdt,"%b %d %Y") dt = time.strftime('%A, %d %B, %Y',mytime) origin = newsoup.find('div',attrs={'class':'origin'}) author = self.tag_to_string(origin) self.log('\tFound article ',title,' at ', url, 'origin: ',author) articles.append({'title':title, 'url':url, 'description':'','date':dt}) feeds.append(('Articles', articles)) return feeds Last edited by EnergyLens; 04-14-2010 at 07:34 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Decorate article headings as hyperlinks to full article?	tomsem	Recipes	5	10-15-2010 08:30 PM
Omitting description an author in parse_index	nickredding	Calibre	0	12-31-2009 04:19 PM
Classic WSJ Article on Ship Dates for Nook	Critteranne	Barnes & Noble NOOK	10	11-16-2009 10:29 PM
Kindle 2 Shipping Dates	Cutestory	Amazon Kindle	29	02-13-2009 11:30 AM
Dates in Russian (?)	Roger Wilmut	Calibre	10	11-24-2008 06:22 PM

03-29-2010, 05:10 AM	#2
kovidgoyal creator of calibre Posts: 43,924 Karma: 22669820 Join Date: Oct 2006 Location: Mumbai, India Device: Various	The date you set above is, IIRC, used only in the index of articles in any given section. What date are ou trying to set? The date used in the title of the downloaded ebook?

03-29-2010, 12:55 PM	#4
kovidgoyal creator of calibre Posts: 43,924 Karma: 22669820 Join Date: Oct 2006 Location: Mumbai, India Device: Various	old_articles is only used for RSS processing. If you are writing a parse_index yourself, just compare the dates and skip tho old articles yourself.

04-14-2010, 01:56 AM	#6
kovidgoyal creator of calibre Posts: 43,924 Karma: 22669820 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Maybe that's the case, I'll have to look at the code to be sure. Open a ticket and I'll get to it when I have some time.

04-21-2010, 10:13 PM	#7
EnergyLens Hack Posts: 34 Karma: 12 Join Date: Dec 2009 Device: Kobo Aura HD, Kindle Paperwhite	I'm beginning to suspect that ebook-convert also ignores --level1-toc= & etc. directives when parse_index is used. I've gotten --levelX-toc to work fine when converting .txt documents and individual .html documents to .epub, but cannot make it work with recipes that use parse_index. Perhaps I'm not understanding something, but I expected it to build a TOC from the Xpath matches in each article returned in feeds. p.s. am I right that --foo= is the only command line argument that ebook-convert will accept apart from those documented? I was trying to pass command line date to my recipes and just happened to use --foo= the first time and it worked. all other attempts to pass command line variables cause ebook-convert to stop with an exception that there is no such option. Ah, the power of FOO! (please don't remove --foo= as that is my only way to pass my own command line arguments !-)

Advert

Advert