![]() |
#1 |
Member
![]() Posts: 22
Karma: 10
Join Date: Aug 2015
Device: Kobo Aura H2O
|
Article title appears at head
Hello, I modified the Cracked.com recipe to customise it but now the heading for each article appears at the end. I've tried editing it but can't get it to work. Any advice?
Thanks. Code:
from calibre.web.feeds.news import BasicNewsRecipe class Cracked(BasicNewsRecipe): title = u'Cracked.com Weekly download' __author__ = 'Update Sept 2017' language = 'en' description = "America's Only HumorSite since 1958" publisher = 'Cracked' category = 'comedy, lists' oldest_article =9 # days max_articles_per_feed = 100 no_stylesheets = True encoding = 'utf-8' remove_javascript = True use_embedded_content = False recursions = 11 remove_attributes = ['size', 'style'] feeds = [(u'Articles', u'http://feeds.feedburner.com/CrackedRSS/')] conversion_options = { 'comment': description, 'tags': category, 'publisher': publisher, 'language': language } keep_only_tags = [dict(name='article', attrs={'class': 'module article dropShadowBottomCurved'}), dict(name='article', attrs={'class': 'module blog dropShadowBottomCurved'}), dict(name='div', attrs={'class': 'content-content'}), dict(name='div', attrs={'class': 'content-header'})] remove_tags = [ dict(name='section', attrs={'class': ['socialTools', 'quickFixModule', 'continue-reading']}), dict(attrs={'class':['socialShareAfterContent', 'socialShareModule', 'continue-reading', 'social-share-bottom list-inline']}), dict(name='div', attrs={'id': ['relatedArticle']}), dict(name='div', attrs={'class': ['bannerAd hidden-sm hidden-md hidden-lg introAd']}) ] def is_link_wanted(self, url, a): return a['class'] == 'next' and a.findParent('nav', attrs={'class':'PaginationContent'}) is not None def preprocess_html(self, soup): for img in soup.findAll('img', attrs={'data-img':True}): img['src'] = img['data-img'] for img in soup.findAll('img', attrs={'data-original':True}): img['src'] = img['data-original'] return soup def postprocess_html(self, soup, first_fetch): for div in soup.findAll(attrs={'class':'PaginationContent'}): div.extract() if not first_fetch: for div in soup.findAll(attrs={'class':'meta'}): div.extract() for h1 in soup.findAll('h1'): h1.extract() for title in soup.findAll('title'): title.extract() return soup |
![]() |
![]() |
![]() |
#2 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,166
Karma: 1410083
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
|
No, they not appears at the end. You have 2 issues
![]() First take a look at postprocess_html. Code:
for h1 in soup.findAll('h1'): h1.extract() This is the first time I saw a mixed up sequence and had a little lesson learned on this. ![]() You can move around when you rearrange the section keep_only_tags. Take in the sequence first content-header and then content-content: Code:
keep_only_tags = [ dict(name='div', attrs={'class': ['content-header'}), dict(name='div', attrs={'class': 'content-content'}), dict(name='article', attrs={'class': 'module article dropShadowBottomCurved'}), dict(name='article', attrs={'class': 'module blog dropShadowBottomCurved'}), ] Code:
keep_only_tags = [ dict(name='div', attrs={'class': [ 'content-content', 'content-header', ]}), dict(name='article', attrs={'class': [ 'module article dropShadowBottomCurved', 'module blog dropShadowBottomCurved', ]}), ] |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Member
![]() Posts: 22
Karma: 10
Join Date: Aug 2015
Device: Kobo Aura H2O
|
Thank you!
I did not know that, it was a bit of a Frankenstein's monster of code. I have tidied up a little of the superfluous metadata if anyone else wanted: Code:
from calibre.web.feeds.news import BasicNewsRecipe class Cracked(BasicNewsRecipe): title = u'Cracked.com Weekly download' __author__ = 'Update Sept 2017' language = 'en' description = "America's Only HumorSite since 1958" publisher = 'Cracked' category = 'comedy, lists' oldest_article =9 # days max_articles_per_feed = 100 no_stylesheets = True encoding = 'utf-8' remove_javascript = True use_embedded_content = False recursions = 11 remove_attributes = ['size', 'style'] feeds = [(u'Articles', u'http://feeds.feedburner.com/CrackedRSS/')] conversion_options = { 'comment': description, 'tags': category, 'publisher': publisher, 'language': language } keep_only_tags = [ dict(name='div', attrs={'class': [ 'content-content', 'content-header', ]}), dict(name='article', attrs={'class': [ 'module article dropShadowBottomCurved', 'module blog dropShadowBottomCurved', ]}), ] remove_tags = [ dict(name='section', attrs={'class': ['socialTools', 'quickFixModule', 'continue-reading']}), dict(attrs={'class':['socialShareAfterContent', 'socialShareModule', 'continue-reading', 'social-share-bottom list-inline']}), dict(name='div', attrs={'id': ['relatedArticle']}), dict(name='ul', attrs={'id': [ 'breadcrumbs', 'socialShare', ]}), dict(name='div', attrs={'class': ['bannerAd hidden-sm hidden-md hidden-lg introAd']}) ] def is_link_wanted(self, url, a): return a['class'] == 'next' and a.findParent('nav', attrs={'class':'PaginationContent'}) is not None def preprocess_html(self, soup): for img in soup.findAll('img', attrs={'data-img':True}): img['src'] = img['data-img'] for img in soup.findAll('img', attrs={'data-original':True}): img['src'] = img['data-original'] return soup def postprocess_html(self, soup, first_fetch): for div in soup.findAll(attrs={'class':'PaginationContent'}): div.extract() if not first_fetch: for div in soup.findAll(attrs={'class':'meta'}): div.extract() return soup |
![]() |
![]() |
![]() |
#4 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,166
Karma: 1410083
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
|
You are welcome.
Best regards, DD |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Bug in 9.1 not adding title to head | eggheadbooks1 | Sigil | 41 | 12-15-2015 05:24 AM |
Book Title as it appears in Marvin | jgt1942 | Marvin | 1 | 04-27-2014 06:36 PM |
Title appears at the top of every page | peter1212 | Sigil | 6 | 10-18-2013 08:24 PM |
Article on the Kobo from the head of Indigo | PeterT | Kobo Reader | 28 | 05-28-2010 04:23 PM |
Article about R&D head at E Ink | TadW | News | 0 | 12-12-2007 04:06 AM |