|  09-04-2017, 08:33 AM | #1 | 
| Member  Posts: 22 Karma: 10 Join Date: Aug 2015 Device: Kobo Aura H2O | 
				
				Article title appears at head
			 
			
			Hello, I modified the Cracked.com recipe to customise it but now the heading for each article appears at the end. I've tried editing it but can't get it to work. Any advice? Thanks. Code: 
from calibre.web.feeds.news import BasicNewsRecipe
class Cracked(BasicNewsRecipe):
    title = u'Cracked.com Weekly download'
    __author__ = 'Update Sept 2017'
    language = 'en'
    description = "America's Only HumorSite since 1958"
    publisher = 'Cracked'
    category = 'comedy, lists'
    oldest_article =9  # days
    max_articles_per_feed = 100
    no_stylesheets = True
    encoding = 'utf-8'
    remove_javascript = True
    use_embedded_content = False
    recursions = 11
    remove_attributes = ['size', 'style']
    feeds = [(u'Articles', u'http://feeds.feedburner.com/CrackedRSS/')]
    conversion_options = {
        'comment': description, 'tags': category, 'publisher': publisher, 'language': language
    }
    keep_only_tags = [dict(name='article', attrs={'class': 'module article dropShadowBottomCurved'}),
                        dict(name='article', attrs={'class': 'module blog dropShadowBottomCurved'}),
                        dict(name='div', attrs={'class': 'content-content'}),
                        dict(name='div', attrs={'class': 'content-header'})]
    remove_tags = [
        dict(name='section', attrs={'class': ['socialTools', 'quickFixModule', 'continue-reading']}),
        dict(attrs={'class':['socialShareAfterContent', 'socialShareModule', 'continue-reading', 'social-share-bottom list-inline']}),
        dict(name='div', attrs={'id': ['relatedArticle']}),
        dict(name='div', attrs={'class': ['bannerAd hidden-sm hidden-md hidden-lg introAd']})
    ]
    def is_link_wanted(self, url, a):
        return a['class'] == 'next' and a.findParent('nav', attrs={'class':'PaginationContent'}) is not None
    def preprocess_html(self, soup):
        for img in soup.findAll('img', attrs={'data-img':True}):
            img['src'] = img['data-img']
        for img in soup.findAll('img', attrs={'data-original':True}):
            img['src'] = img['data-original']                     
        return soup
    
    def postprocess_html(self, soup, first_fetch):
        for div in soup.findAll(attrs={'class':'PaginationContent'}):
            div.extract()
        if not first_fetch:
            for div in soup.findAll(attrs={'class':'meta'}):
                div.extract()
        for h1 in soup.findAll('h1'):
                h1.extract()
        for title in soup.findAll('title'):
                title.extract()   
        return soup | 
|   |   | 
|  09-05-2017, 06:30 AM | #2 | 
| Wizard            Posts: 1,166 Karma: 1410083 Join Date: Nov 2010 Location: Germany Device: Sony PRS-650 | 
			
			No, they not appears at the end. You have 2 issues   First take a look at postprocess_html. Code: for h1 in soup.findAll('h1'):
                h1.extract()This is the first time I saw a mixed up sequence and had a little lesson learned on this.  You can move around when you rearrange the section keep_only_tags. Take in the sequence first content-header and then content-content: Code:     keep_only_tags = [  
                      dict(name='div', attrs={'class': ['content-header'}),
                      dict(name='div', attrs={'class': 'content-content'}),
                      dict(name='article', attrs={'class': 'module article dropShadowBottomCurved'}),
                      dict(name='article', attrs={'class': 'module blog dropShadowBottomCurved'}),
                      ]Code:     keep_only_tags = [  
                      dict(name='div', attrs={'class': [
                                                'content-content',
                                                'content-header',
                                                        ]}),
                      dict(name='article', attrs={'class': [
                                                'module article dropShadowBottomCurved',
                                                'module blog dropShadowBottomCurved',
                                                            ]}),
                      ] | 
|   |   | 
|  09-06-2017, 03:10 PM | #3 | 
| Member  Posts: 22 Karma: 10 Join Date: Aug 2015 Device: Kobo Aura H2O | 
			
			Thank you!  I did not know that, it was a bit of a Frankenstein's monster of code. I have tidied up a little of the superfluous metadata if anyone else wanted: Code: from calibre.web.feeds.news import BasicNewsRecipe
class Cracked(BasicNewsRecipe):
    title = u'Cracked.com Weekly download'
    __author__ = 'Update Sept 2017'
    language = 'en'
    description = "America's Only HumorSite since 1958"
    publisher = 'Cracked'
    category = 'comedy, lists'
    oldest_article =9  # days
    max_articles_per_feed = 100
    no_stylesheets = True
    encoding = 'utf-8'
    remove_javascript = True
    use_embedded_content = False
    recursions = 11
    remove_attributes = ['size', 'style']
  
    feeds = [(u'Articles', u'http://feeds.feedburner.com/CrackedRSS/')]
    conversion_options = {
        'comment': description, 'tags': category, 'publisher': publisher, 'language': language
    }
   
    keep_only_tags = [  
                    dict(name='div', attrs={'class': [
                                                'content-content',
                                                'content-header',
                                                        ]}),
                    dict(name='article', attrs={'class': [
                                                'module article dropShadowBottomCurved',
                                                'module blog dropShadowBottomCurved',
                                                            ]}),
                      ]
    remove_tags = [
        dict(name='section', attrs={'class': ['socialTools', 'quickFixModule', 'continue-reading']}),
        dict(attrs={'class':['socialShareAfterContent', 'socialShareModule', 'continue-reading', 'social-share-bottom list-inline']}),
        dict(name='div', attrs={'id': ['relatedArticle']}),
        dict(name='ul', attrs={'id': [
                                'breadcrumbs',
                                'socialShare',
                                ]}),       
        dict(name='div', attrs={'class': ['bannerAd hidden-sm hidden-md hidden-lg introAd']})
    ]
    def is_link_wanted(self, url, a):
        return a['class'] == 'next' and a.findParent('nav', attrs={'class':'PaginationContent'}) is not None
    def preprocess_html(self, soup):
        for img in soup.findAll('img', attrs={'data-img':True}):
            img['src'] = img['data-img']
        for img in soup.findAll('img', attrs={'data-original':True}):
            img['src'] = img['data-original']                     
        return soup
    
    def postprocess_html(self, soup, first_fetch):
        for div in soup.findAll(attrs={'class':'PaginationContent'}):
            div.extract()
        if not first_fetch:
            for div in soup.findAll(attrs={'class':'meta'}):
                div.extract()
 
        return soup | 
|   |   | 
|  09-07-2017, 04:30 AM | #4 | 
| Wizard            Posts: 1,166 Karma: 1410083 Join Date: Nov 2010 Location: Germany Device: Sony PRS-650 | 
			
			You are welcome. Best regards, DD | 
|   |   | 
|  | 
| Thread Tools | Search this Thread | 
| 
 | 
|  Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| Bug in 9.1 not adding title to head | eggheadbooks1 | Sigil | 41 | 12-15-2015 05:24 AM | 
| Book Title as it appears in Marvin | jgt1942 | Marvin | 1 | 04-27-2014 06:36 PM | 
| Title appears at the top of every page | peter1212 | Sigil | 6 | 10-18-2013 08:24 PM | 
| Article on the Kobo from the head of Indigo | PeterT | Kobo Reader | 28 | 05-28-2010 04:23 PM | 
| Article about R&D head at E Ink | TadW | News | 0 | 12-12-2007 04:06 AM |