View Single Post
Old 11-22-2018, 10:44 AM   #4
lui1
Enthusiast
lui1 began at the beginning.
 
Posts: 36
Karma: 10
Join Date: Dec 2017
Location: Los Angeles, CA
Device: Smart Phone
Quote:
Originally Posted by NSILMike View Post
Thanks, but which lines changed?
The main problem was the "xpath expression" found on line 49 in the original recipe, which describes the pattern to find the cover of the magazine. The other changes or just cosmetic improvements which just remove things that don't belong to the articles, e.g. ads and unrelated material.

Here's a diff of the original and modified versions:
Code:
--- original	2018-11-22 07:22:37.923388857 -0800
+++ modified	2018-11-22 07:24:10.971801489 -0800
@@ -1,7 +1,7 @@
 from calibre.web.feeds.news import BasicNewsRecipe
 from collections import defaultdict
 
-BASE = 'http://www.newsweek.com'
+BASE = 'https://www.newsweek.com'
 
 
 def href_to_url(a, add_piano=False):
@@ -23,15 +23,18 @@
     no_stylesheets = True
     requires_version = (1, 40, 0)
 
-    keep_only_tags = class_sels(
-        'article-header', 'article-body', 'header-image')
+    keep_only_tags = [
+        dict(id='block-nw-magazine-article-header'),
+        class_sels('article-header', 'article-body')
+    ]
     remove_tags = [
-        dict(name='meta'),
+        dict(name=['aside', 'meta', 'source']),
         class_sels(
             'block-openadstream', 'block-ibtmedia-social', 'issue-next',
             'most-popular', 'ibt-media-stories', 'user-btn-group',
             'trial-link', 'trc_related_container',
             'block-ibtmedia-top-stories', 'videocontent', 'newsletter-signup',
+            'in-text-slideshows', 'content-correction', 'article-navigation'
         ),
         dict(id=['taboola-below-main-column', 'piano-root',
                  'block-nw-magazine-magazine-more-from-issue']),
@@ -46,7 +49,7 @@
         a = li.xpath('descendant::a[@href]')[0]
         url = href_to_url(a, add_piano=True)
         self.timefmt = self.tag_to_string(a)
-        img = li.xpath('descendant::a[@href]/img[@src]')[0]
+        img = li.xpath('descendant::a[@href]//img[@src]')[0]
         self.cover_url = img.get('src')
         root = self.index_to_soup(url, as_tree=True)
         features = []
lui1 is offline   Reply With Quote