View Single Post
Old 09-05-2017, 06:30 AM   #2
Divingduck
Wizard
Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.
 
Posts: 1,166
Karma: 1410083
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
No, they not appears at the end. You have 2 issues

First take a look at postprocess_html.
Code:
for h1 in soup.findAll('h1'):
                h1.extract()
The statement push calibre to delete all h1 tags and this means no h1 headings at all.

This is the first time I saw a mixed up sequence and had a little lesson learned on this.
You can move around when you rearrange the section keep_only_tags. Take in the sequence first content-header and then content-content:
Code:
    keep_only_tags = [  
                      dict(name='div', attrs={'class': ['content-header'}),
                      dict(name='div', attrs={'class': 'content-content'}),
                      dict(name='article', attrs={'class': 'module article dropShadowBottomCurved'}),
                      dict(name='article', attrs={'class': 'module blog dropShadowBottomCurved'}),
                      ]
But the better way around is to put all similar dict statements in one statement:

Code:
    keep_only_tags = [  
                      dict(name='div', attrs={'class': [
                                                'content-content',
                                                'content-header',
                                                        ]}),
                      dict(name='article', attrs={'class': [
                                                'module article dropShadowBottomCurved',
                                                'module blog dropShadowBottomCurved',
                                                            ]}),
                      ]
Divingduck is offline   Reply With Quote