Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 09-04-2017, 08:33 AM   #1
Phoebus
Member
Phoebus began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Aug 2015
Device: Kobo Aura H2O
Article title appears at head

Hello, I modified the Cracked.com recipe to customise it but now the heading for each article appears at the end. I've tried editing it but can't get it to work. Any advice?

Thanks.

Code:
from calibre.web.feeds.news import BasicNewsRecipe


class Cracked(BasicNewsRecipe):
    title = u'Cracked.com Weekly download'
    __author__ = 'Update Sept 2017'
    language = 'en'
    description = "America's Only HumorSite since 1958"
    publisher = 'Cracked'
    category = 'comedy, lists'
    oldest_article =9  # days
    max_articles_per_feed = 100
    no_stylesheets = True
    encoding = 'utf-8'
    remove_javascript = True
    use_embedded_content = False
    recursions = 11
    remove_attributes = ['size', 'style']

    feeds = [(u'Articles', u'http://feeds.feedburner.com/CrackedRSS/')]

    conversion_options = {
        'comment': description, 'tags': category, 'publisher': publisher, 'language': language
    }

    keep_only_tags = [dict(name='article', attrs={'class': 'module article dropShadowBottomCurved'}),
                        dict(name='article', attrs={'class': 'module blog dropShadowBottomCurved'}),
                        dict(name='div', attrs={'class': 'content-content'}),
                        dict(name='div', attrs={'class': 'content-header'})]

    remove_tags = [
        dict(name='section', attrs={'class': ['socialTools', 'quickFixModule', 'continue-reading']}),
        dict(attrs={'class':['socialShareAfterContent', 'socialShareModule', 'continue-reading', 'social-share-bottom list-inline']}),
        dict(name='div', attrs={'id': ['relatedArticle']}),
        dict(name='div', attrs={'class': ['bannerAd hidden-sm hidden-md hidden-lg introAd']})
    ]

    def is_link_wanted(self, url, a):
        return a['class'] == 'next' and a.findParent('nav', attrs={'class':'PaginationContent'}) is not None

    def preprocess_html(self, soup):
        for img in soup.findAll('img', attrs={'data-img':True}):
            img['src'] = img['data-img']
        for img in soup.findAll('img', attrs={'data-original':True}):
            img['src'] = img['data-original']                     
        return soup
    
    def postprocess_html(self, soup, first_fetch):
        for div in soup.findAll(attrs={'class':'PaginationContent'}):
            div.extract()
        if not first_fetch:
            for div in soup.findAll(attrs={'class':'meta'}):
                div.extract()
        for h1 in soup.findAll('h1'):
                h1.extract()
        for title in soup.findAll('title'):
                title.extract()   
        return soup
Phoebus is offline   Reply With Quote
Old 09-05-2017, 06:30 AM   #2
Divingduck
Wizard
Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.
 
Posts: 1,166
Karma: 1410083
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
No, they not appears at the end. You have 2 issues

First take a look at postprocess_html.
Code:
for h1 in soup.findAll('h1'):
                h1.extract()
The statement push calibre to delete all h1 tags and this means no h1 headings at all.

This is the first time I saw a mixed up sequence and had a little lesson learned on this.
You can move around when you rearrange the section keep_only_tags. Take in the sequence first content-header and then content-content:
Code:
    keep_only_tags = [  
                      dict(name='div', attrs={'class': ['content-header'}),
                      dict(name='div', attrs={'class': 'content-content'}),
                      dict(name='article', attrs={'class': 'module article dropShadowBottomCurved'}),
                      dict(name='article', attrs={'class': 'module blog dropShadowBottomCurved'}),
                      ]
But the better way around is to put all similar dict statements in one statement:

Code:
    keep_only_tags = [  
                      dict(name='div', attrs={'class': [
                                                'content-content',
                                                'content-header',
                                                        ]}),
                      dict(name='article', attrs={'class': [
                                                'module article dropShadowBottomCurved',
                                                'module blog dropShadowBottomCurved',
                                                            ]}),
                      ]
Divingduck is offline   Reply With Quote
Advert
Old 09-06-2017, 03:10 PM   #3
Phoebus
Member
Phoebus began at the beginning.
 
Posts: 22
Karma: 10
Join Date: Aug 2015
Device: Kobo Aura H2O
Thank you!

I did not know that, it was a bit of a Frankenstein's monster of code. I have tidied up a little of the superfluous metadata if anyone else wanted:

Code:
from calibre.web.feeds.news import BasicNewsRecipe


class Cracked(BasicNewsRecipe):
    title = u'Cracked.com Weekly download'
    __author__ = 'Update Sept 2017'
    language = 'en'
    description = "America's Only HumorSite since 1958"
    publisher = 'Cracked'
    category = 'comedy, lists'
    oldest_article =9  # days
    max_articles_per_feed = 100
    no_stylesheets = True
    encoding = 'utf-8'
    remove_javascript = True
    use_embedded_content = False
    recursions = 11
    remove_attributes = ['size', 'style']
  

    feeds = [(u'Articles', u'http://feeds.feedburner.com/CrackedRSS/')]

    conversion_options = {
        'comment': description, 'tags': category, 'publisher': publisher, 'language': language
    }
   
    keep_only_tags = [  
                    dict(name='div', attrs={'class': [
                                                'content-content',
                                                'content-header',
                                                        ]}),
                    dict(name='article', attrs={'class': [
                                                'module article dropShadowBottomCurved',
                                                'module blog dropShadowBottomCurved',
                                                            ]}),
                      ]

    remove_tags = [
        dict(name='section', attrs={'class': ['socialTools', 'quickFixModule', 'continue-reading']}),
        dict(attrs={'class':['socialShareAfterContent', 'socialShareModule', 'continue-reading', 'social-share-bottom list-inline']}),
        dict(name='div', attrs={'id': ['relatedArticle']}),
        dict(name='ul', attrs={'id': [
                                'breadcrumbs',
                                'socialShare',
                                ]}),       
        dict(name='div', attrs={'class': ['bannerAd hidden-sm hidden-md hidden-lg introAd']})
    ]

    def is_link_wanted(self, url, a):
        return a['class'] == 'next' and a.findParent('nav', attrs={'class':'PaginationContent'}) is not None

    def preprocess_html(self, soup):
        for img in soup.findAll('img', attrs={'data-img':True}):
            img['src'] = img['data-img']
        for img in soup.findAll('img', attrs={'data-original':True}):
            img['src'] = img['data-original']                     
        return soup
    
    def postprocess_html(self, soup, first_fetch):
        for div in soup.findAll(attrs={'class':'PaginationContent'}):
            div.extract()
        if not first_fetch:
            for div in soup.findAll(attrs={'class':'meta'}):
                div.extract()
 
        return soup
Phoebus is offline   Reply With Quote
Old 09-07-2017, 04:30 AM   #4
Divingduck
Wizard
Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.
 
Posts: 1,166
Karma: 1410083
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
You are welcome.
Best regards, DD
Divingduck is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Bug in 9.1 not adding title to head eggheadbooks1 Sigil 41 12-15-2015 05:24 AM
Book Title as it appears in Marvin jgt1942 Marvin 1 04-27-2014 06:36 PM
Title appears at the top of every page peter1212 Sigil 6 10-18-2013 08:24 PM
Article on the Kobo from the head of Indigo PeterT Kobo Reader 28 05-28-2010 04:23 PM
Article about R&D head at E Ink TadW News 0 12-12-2007 04:06 AM


All times are GMT -4. The time now is 10:51 PM.


MobileRead.com is a privately owned, operated and funded community.