|
|
#1 |
|
Member
![]() Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
|
My first (non-RSS) recipe is almost working. It is a little odd because it is not pulling up an article per se but IMDB search results.
When I had auto_cleanup set to False, it made me appreciate the value of auto_cleanup. So now when I use it, I find I need auto_cleanup_keep, but for some reason the line Code:
auto_cleanup_keep = "//div[@id='titleStoryLine']" Code:
from calibre.web.feeds.news import BasicNewsRecipe
class IMDBTopMoviesHi9999(BasicNewsRecipe):
language = 'en'
__author__ = 'ireadtheinternet'
oldest_article = 999999
max_articles_per_feed = 9999
no_stylesheets = True
no_javascript = True
auto_cleanup = True
auto_cleanup_keep = "//div[@id='titleStoryLine']"
# For some unknown reason this also keeps //div[@id='titleDidYouKnow']
conversion_options = {'linearize_tables' : True}
def parse_index(self):
toc = self.index_to_soup('http://www.imdb.com/search/title?languages=hi&num_votes=500,&production_status=released&sort=year,desc&title_type=feature&user_rating=7.0,10')
articles = []
for row in toc.findAll('td', attrs={'class':'title'}):
release_year = self.tag_to_string(row.find('span', attrs={'class':'year_type'}))
link = row.find('a')
title = self.tag_to_string(link) + " " + release_year
url = 'http://www.imdb.com' + link['href']
self.log('Found article:', link)
self.log('\t', url)
articles.append({'title':title, 'url':url, 'description':''})
self.title = self.tag_to_string(toc.find('h1'))
# only one section, so...
return [(self.title, articles)]
|
|
|
|
|
|
#2 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,681
Karma: 28549304
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
auto_cleanup is a littel hard ot control. You'll probably find it easier to just use keep_only_tags and remove_tags
|
|
|
|
| Advert | |
|
|
|
|
#3 |
|
Member
![]() Posts: 21
Karma: 10
Join Date: Oct 2014
Device: Android
|
Thanks, Kovid, this helped a lot. I am still working on the overall recipe, but this part is out the way now.
Just for no reason, thought I would post what I ended up with. Not the whole recipe, because it is still in progress, just the attributes mentioned. Code:
keep_only_tags = [
dict(name='div', attrs={'id': ['title-overview-widget']}),
]
remove_tags = [
dict(name='div', attrs={'id': ['meterChangeRow']}),
dict(name='div', attrs={'id': ['meterHeaderBox']}),
dict(name='div', attrs={'id': ['meterSeeMoreRow']}),
dict(name='div', attrs={'class': ['star-box-rating-widget']}),
dict(name='div', attrs={'class': ['star-box-details']}),
dict(name='td', attrs={'id': ['overview-bottom']}),
dict(name='div', attrs={'class': ['pro-title-link text-center']})
]
Last edited by ireadtheinternet; 10-31-2014 at 08:57 PM. |
|
|
|
![]() |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Are you keeping score? | Kasper Hviid | Kobo Reader | 13 | 06-15-2014 04:09 PM |
| Who is still on the fence about keeping their PW? | sparklemotion | Amazon Kindle | 21 | 10-14-2012 12:49 AM |
| Keeping Wifi on? | Tsaukpaetra | Kindle Developer's Corner | 4 | 11-15-2011 02:39 PM |
| Keeping it clean | cypherslock | Kobo Tablets | 3 | 11-01-2011 08:42 PM |