Article title appears at head

Phoebus · 09-04-2017, 08:33 AM

Hello, I modified the Cracked.com recipe to customise it but now the heading for each article appears at the end. I've tried editing it but can't get it to work. Any advice?

Thanks.

Code:

from calibre.web.feeds.news import BasicNewsRecipe


class Cracked(BasicNewsRecipe):
    title = u'Cracked.com Weekly download'
    __author__ = 'Update Sept 2017'
    language = 'en'
    description = "America's Only HumorSite since 1958"
    publisher = 'Cracked'
    category = 'comedy, lists'
    oldest_article =9  # days
    max_articles_per_feed = 100
    no_stylesheets = True
    encoding = 'utf-8'
    remove_javascript = True
    use_embedded_content = False
    recursions = 11
    remove_attributes = ['size', 'style']

    feeds = [(u'Articles', u'http://feeds.feedburner.com/CrackedRSS/')]

    conversion_options = {
        'comment': description, 'tags': category, 'publisher': publisher, 'language': language
    }

    keep_only_tags = [dict(name='article', attrs={'class': 'module article dropShadowBottomCurved'}),
                        dict(name='article', attrs={'class': 'module blog dropShadowBottomCurved'}),
                        dict(name='div', attrs={'class': 'content-content'}),
                        dict(name='div', attrs={'class': 'content-header'})]

    remove_tags = [
        dict(name='section', attrs={'class': ['socialTools', 'quickFixModule', 'continue-reading']}),
        dict(attrs={'class':['socialShareAfterContent', 'socialShareModule', 'continue-reading', 'social-share-bottom list-inline']}),
        dict(name='div', attrs={'id': ['relatedArticle']}),
        dict(name='div', attrs={'class': ['bannerAd hidden-sm hidden-md hidden-lg introAd']})
    ]

    def is_link_wanted(self, url, a):
        return a['class'] == 'next' and a.findParent('nav', attrs={'class':'PaginationContent'}) is not None

    def preprocess_html(self, soup):
        for img in soup.findAll('img', attrs={'data-img':True}):
            img['src'] = img['data-img']
        for img in soup.findAll('img', attrs={'data-original':True}):
            img['src'] = img['data-original']                     
        return soup
    
    def postprocess_html(self, soup, first_fetch):
        for div in soup.findAll(attrs={'class':'PaginationContent'}):
            div.extract()
        if not first_fetch:
            for div in soup.findAll(attrs={'class':'meta'}):
                div.extract()
        for h1 in soup.findAll('h1'):
                h1.extract()
        for title in soup.findAll('title'):
                title.extract()   
        return soup

Divingduck · 09-05-2017, 06:30 AM

No, they not appears at the end. You have 2 issues

First take a look at postprocess_html.

Code:

for h1 in soup.findAll('h1'):
                h1.extract()

The statement push calibre to delete all h1 tags and this means no h1 headings at all.

This is the first time I saw a mixed up sequence and had a little lesson learned on this.

You can move around when you rearrange the section keep_only_tags. Take in the sequence first content-header and then content-content:

Code:

    keep_only_tags = [  
                      dict(name='div', attrs={'class': ['content-header'}),
                      dict(name='div', attrs={'class': 'content-content'}),
                      dict(name='article', attrs={'class': 'module article dropShadowBottomCurved'}),
                      dict(name='article', attrs={'class': 'module blog dropShadowBottomCurved'}),
                      ]

But the better way around is to put all similar dict statements in one statement:

Code:

    keep_only_tags = [  
                      dict(name='div', attrs={'class': [
                                                'content-content',
                                                'content-header',
                                                        ]}),
                      dict(name='article', attrs={'class': [
                                                'module article dropShadowBottomCurved',
                                                'module blog dropShadowBottomCurved',
                                                            ]}),
                      ]

Phoebus · 09-06-2017, 03:10 PM

Thank you!

I did not know that, it was a bit of a Frankenstein's monster of code. I have tidied up a little of the superfluous metadata if anyone else wanted:

Code:

from calibre.web.feeds.news import BasicNewsRecipe


class Cracked(BasicNewsRecipe):
    title = u'Cracked.com Weekly download'
    __author__ = 'Update Sept 2017'
    language = 'en'
    description = "America's Only HumorSite since 1958"
    publisher = 'Cracked'
    category = 'comedy, lists'
    oldest_article =9  # days
    max_articles_per_feed = 100
    no_stylesheets = True
    encoding = 'utf-8'
    remove_javascript = True
    use_embedded_content = False
    recursions = 11
    remove_attributes = ['size', 'style']
  

    feeds = [(u'Articles', u'http://feeds.feedburner.com/CrackedRSS/')]

    conversion_options = {
        'comment': description, 'tags': category, 'publisher': publisher, 'language': language
    }
   
    keep_only_tags = [  
                    dict(name='div', attrs={'class': [
                                                'content-content',
                                                'content-header',
                                                        ]}),
                    dict(name='article', attrs={'class': [
                                                'module article dropShadowBottomCurved',
                                                'module blog dropShadowBottomCurved',
                                                            ]}),
                      ]

    remove_tags = [
        dict(name='section', attrs={'class': ['socialTools', 'quickFixModule', 'continue-reading']}),
        dict(attrs={'class':['socialShareAfterContent', 'socialShareModule', 'continue-reading', 'social-share-bottom list-inline']}),
        dict(name='div', attrs={'id': ['relatedArticle']}),
        dict(name='ul', attrs={'id': [
                                'breadcrumbs',
                                'socialShare',
                                ]}),       
        dict(name='div', attrs={'class': ['bannerAd hidden-sm hidden-md hidden-lg introAd']})
    ]

    def is_link_wanted(self, url, a):
        return a['class'] == 'next' and a.findParent('nav', attrs={'class':'PaginationContent'}) is not None

    def preprocess_html(self, soup):
        for img in soup.findAll('img', attrs={'data-img':True}):
            img['src'] = img['data-img']
        for img in soup.findAll('img', attrs={'data-original':True}):
            img['src'] = img['data-original']                     
        return soup
    
    def postprocess_html(self, soup, first_fetch):
        for div in soup.findAll(attrs={'class':'PaginationContent'}):
            div.extract()
        if not first_fetch:
            for div in soup.findAll(attrs={'class':'meta'}):
                div.extract()
 
        return soup

Divingduck · 09-07-2017, 04:30 AM

You are welcome.
Best regards, DD

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Bug in 9.1 not adding title to head	eggheadbooks1	Sigil	41	12-15-2015 05:24 AM
Book Title as it appears in Marvin	jgt1942	Marvin	1	04-27-2014 06:36 PM
Title appears at the top of every page	peter1212	Sigil	6	10-18-2013 08:24 PM
Article on the Kobo from the head of Indigo	PeterT	Kobo Reader	28	05-28-2010 04:23 PM
Article about R&D head at E Ink	TadW	News	0	12-12-2007 04:06 AM

09-05-2017, 06:30 AM	#2
Divingduck Wizard Posts: 1,166 Karma: 1410083 Join Date: Nov 2010 Location: Germany Device: Sony PRS-650	No, they not appears at the end. You have 2 issues First take a look at postprocess_html. Code: for h1 in soup.findAll('h1'): h1.extract() The statement push calibre to delete all h1 tags and this means no h1 headings at all. This is the first time I saw a mixed up sequence and had a little lesson learned on this. You can move around when you rearrange the section keep_only_tags. Take in the sequence first content-header and then content-content: Code: keep_only_tags = [ dict(name='div', attrs={'class': ['content-header'}), dict(name='div', attrs={'class': 'content-content'}), dict(name='article', attrs={'class': 'module article dropShadowBottomCurved'}), dict(name='article', attrs={'class': 'module blog dropShadowBottomCurved'}), ] But the better way around is to put all similar dict statements in one statement: Code: keep_only_tags = [ dict(name='div', attrs={'class': [ 'content-content', 'content-header', ]}), dict(name='article', attrs={'class': [ 'module article dropShadowBottomCurved', 'module blog dropShadowBottomCurved', ]}), ]

09-07-2017, 04:30 AM	#4
Divingduck Wizard Posts: 1,166 Karma: 1410083 Join Date: Nov 2010 Location: Germany Device: Sony PRS-650	You are welcome. Best regards, DD

Advert