![]() |
#1 |
Member
![]() Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
|
![]()
My first (non-RSS) recipe is almost working. It is a little odd because it is not pulling up an article per se but IMDB search results.
When I had auto_cleanup set to False, it made me appreciate the value of auto_cleanup. So now when I use it, I find I need auto_cleanup_keep, but for some reason the line Code:
auto_cleanup_keep = "//div[@id='titleStoryLine']" Code:
from calibre.web.feeds.news import BasicNewsRecipe class IMDBTopMoviesHi9999(BasicNewsRecipe): language = 'en' __author__ = 'ireadtheinternet' oldest_article = 999999 max_articles_per_feed = 9999 no_stylesheets = True no_javascript = True auto_cleanup = True auto_cleanup_keep = "//div[@id='titleStoryLine']" # For some unknown reason this also keeps //div[@id='titleDidYouKnow'] conversion_options = {'linearize_tables' : True} def parse_index(self): toc = self.index_to_soup('http://www.imdb.com/search/title?languages=hi&num_votes=500,&production_status=released&sort=year,desc&title_type=feature&user_rating=7.0,10') articles = [] for row in toc.findAll('td', attrs={'class':'title'}): release_year = self.tag_to_string(row.find('span', attrs={'class':'year_type'})) link = row.find('a') title = self.tag_to_string(link) + " " + release_year url = 'http://www.imdb.com' + link['href'] self.log('Found article:', link) self.log('\t', url) articles.append({'title':title, 'url':url, 'description':''}) self.title = self.tag_to_string(toc.find('h1')) # only one section, so... return [(self.title, articles)] |
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,345
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
auto_cleanup is a littel hard ot control. You'll probably find it easier to just use keep_only_tags and remove_tags
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Member
![]() Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
|
![]()
Thanks, Kovid, this helped a lot. I am still working on the overall recipe, but this part is out the way now.
Just for no reason, thought I would post what I ended up with. Not the whole recipe, because it is still in progress, just the attributes mentioned. Code:
keep_only_tags = [ dict(name='div', attrs={'id': ['title-overview-widget']}), ] remove_tags = [ dict(name='div', attrs={'id': ['meterChangeRow']}), dict(name='div', attrs={'id': ['meterHeaderBox']}), dict(name='div', attrs={'id': ['meterSeeMoreRow']}), dict(name='div', attrs={'class': ['star-box-rating-widget']}), dict(name='div', attrs={'class': ['star-box-details']}), dict(name='td', attrs={'id': ['overview-bottom']}), dict(name='div', attrs={'class': ['pro-title-link text-center']}) ] Last edited by ireadtheinternet; 10-31-2014 at 07:57 PM. |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Are you keeping score? | Kasper Hviid | Kobo Reader | 13 | 06-15-2014 03:09 PM |
Who is still on the fence about keeping their PW? | sparklemotion | Amazon Kindle | 21 | 10-13-2012 11:49 PM |
Keeping Wifi on? | Tsaukpaetra | Kindle Developer's Corner | 4 | 11-15-2011 01:39 PM |
Keeping it clean | cypherslock | Kobo Tablets | 3 | 11-01-2011 07:42 PM |