MobileRead Forums - View Single Post

Divingduck · 09-05-2017, 07:30 AM

No, they not appears at the end. You have 2 issues

First take a look at postprocess_html.

Code:

for h1 in soup.findAll('h1'):
                h1.extract()

The statement push calibre to delete all h1 tags and this means no h1 headings at all.

This is the first time I saw a mixed up sequence and had a little lesson learned on this.

You can move around when you rearrange the section keep_only_tags. Take in the sequence first content-header and then content-content:

Code:

    keep_only_tags = [  
                      dict(name='div', attrs={'class': ['content-header'}),
                      dict(name='div', attrs={'class': 'content-content'}),
                      dict(name='article', attrs={'class': 'module article dropShadowBottomCurved'}),
                      dict(name='article', attrs={'class': 'module blog dropShadowBottomCurved'}),
                      ]

But the better way around is to put all similar dict statements in one statement:

Code:

    keep_only_tags = [  
                      dict(name='div', attrs={'class': [
                                                'content-content',
                                                'content-header',
                                                        ]}),
                      dict(name='article', attrs={'class': [
                                                'module article dropShadowBottomCurved',
                                                'module blog dropShadowBottomCurved',
                                                            ]}),
                      ]

09-05-2017, 07:30 AM	#2
Divingduck Wizard Posts: 1,166 Karma: 1410083 Join Date: Nov 2010 Location: Germany Device: Sony PRS-650	No, they not appears at the end. You have 2 issues First take a look at postprocess_html. Code: for h1 in soup.findAll('h1'): h1.extract() The statement push calibre to delete all h1 tags and this means no h1 headings at all. This is the first time I saw a mixed up sequence and had a little lesson learned on this. You can move around when you rearrange the section keep_only_tags. Take in the sequence first content-header and then content-content: Code: keep_only_tags = [ dict(name='div', attrs={'class': ['content-header'}), dict(name='div', attrs={'class': 'content-content'}), dict(name='article', attrs={'class': 'module article dropShadowBottomCurved'}), dict(name='article', attrs={'class': 'module blog dropShadowBottomCurved'}), ] But the better way around is to put all similar dict statements in one statement: Code: keep_only_tags = [ dict(name='div', attrs={'class': [ 'content-content', 'content-header', ]}), dict(name='article', attrs={'class': [ 'module article dropShadowBottomCurved', 'module blog dropShadowBottomCurved', ]}), ]