03-01-2011, 03:17 PM | #1 |
Connoisseur
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
|
Recipe for Helsingin Sanomat
Certainly a minority linguistic interest, there being no Finnish news included yet with Calibre, but may also be of use as an example to anyone encountering problems with new recipes due to HTML tables in the feed content.
Helsingin Sanomat places the feed content within HTML <table> tags. Without the "'linearize_tables' : True" conversions_options below this would result in an e-book in mobi format which shows a single page only for each article both on Kindle and in the MobiPocket reader for PC, losing the rest of each article after the part which fits on that first page. The recipe also illustrates handling of printable page versions (the "tulosta" below) where the RSS feeds supply the page URL needed in two different forms, with or without a "?ref=rss" at the end. Code:
class AdvancedUserRecipe1298137661(BasicNewsRecipe): title = u'Helsingin Sanomat' oldest_article = 7 max_articles_per_feed = 100 no_stylesheets = True remove_javascript = True conversion_options = { 'linearize_tables' : True } remove_tags = [ dict(name='a', attrs={'id':'articleCommentUrl'}), dict(name='p', attrs={'class':'newsSummary'}), dict(name='div', attrs={'class':'headerTools'}) ] feeds = [(u'Uutiset - HS.fi', u'http://www.hs.fi/uutiset/rss/'), (u'Politiikka - HS.fi', u'http://www.hs.fi/politiikka/rss/'), (u'Ulkomaat - HS.fi', u'http://www.hs.fi/ulkomaat/rss/'), (u'Kulttuuri - HS.fi', u'http://www.hs.fi/kulttuuri/rss/'), (u'Kirjat - HS.fi', u'http://www.hs.fi/kulttuuri/kirjat/rss/'), (u'Elokuvat - HS.fi', u'http://www.hs.fi/kulttuuri/elokuvat/rss/') ] def print_version(self, url): j = url.rfind("/") s = url[j:] i = s.rfind("?ref=rss") if i > 0: s = s[:i] return "http://www.hs.fi/tulosta" + s |
10-12-2011, 06:24 AM | #2 |
Junior Member
Posts: 2
Karma: 10
Join Date: May 2011
Device: Kindle 3 Wifi
|
This recipe is not working anymore as Helsingin Sanomat has changed their website structure. Nowadays the print versions of the pages are created using JavaScript.
|
10-14-2011, 10:48 AM | #3 | |
Connoisseur
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
|
Quote:
Spoiler:
The revision is made by removing the remove_tags lines, adding a keep_only_tags line, and removing the print_version definition. I have retained the removed lines as comments, and commented the feeds which are not working now. I'll post a new version if I can make these feeds work with the same recipe which now works for the main news feed. |
|
10-14-2011, 11:16 AM | #4 | |
Connoisseur
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
|
Quote:
Code:
keep_only_tags = [dict(name='div', attrs={'id':'main-content'}), dict(name='div', attrs={'class':'contentNewsArticle'})] All sections except politics (Politiikka) extract. As there is no content at present in the Politiikka feed, I hope it too will extract when there is content. |
|
09-10-2021, 08:34 AM | #5 |
Connoisseur
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
|
Updated recipes for Helsingin Sanomat and Аргументы и Факты
NOTE THAT THE UPDATED RECIPE FOR Аргументы и Факты REQUIRES TWO SMALL CHANGES TO CALIBRE SOURCE CODE, DISCUSSED BELOW
Helsingin Sanomat: ======================================== This recipe provides four sections of the paper (five on Sunday) ======================================== #!/usr/bin/env python2 # vim:fileencoding=utf-8 from __future__ import unicode_literals, division, absolute_import, print_function from datetime import date from calibre.web.feeds.news import BasicNewsRecipe class AdvancedUserRecipe1631181034(BasicNewsRecipe): title = 'Helsingin Sanomat' language = 'fi' oldest_article = 7 max_articles_per_feed = 200 auto_cleanup = True feeds = [ ('Helsingin Sanomat', 'https://www.hs.fi'), ] INDEX = 'https://www.hs.fi/' def do_Section(self, nxtINDEX, section_title, feeds): articles = [] soup = self.index_to_soup(nxtINDEX) ii = 0 for section in soup.findAll('a', attrs={'class':'block'}): if section is not None: ii = ii + 1 z = section.findAll('h2') try: z = z[0].get_text() # strip=True link = section['href'] if link[0:1] == '/': link = 'https://www.hs.fi' + link articles.append({u'title':z, u'url':link}) except Exception as inst: self.log("exception handled") if articles: feeds.append((section_title, articles)) return feeds def parse_index(self): feeds = [] self.do_Section('https://www.hs.fi/', u'Etusivi', feeds) self.do_Section('https://www.hs.fi/kotimaa/', u'Kotimaa', feeds) self.do_Section('https://www.hs.fi/kulttuuri/', u'Kulttuuri', feeds) self.do_Section('https://www.hs.fi/ulkomaat/', u'Ulkomaat', feeds) if date.weekday(date.today()) == 6: self.do_Section('https://www.hs.fi/sunnuntai/', u'Sunnuntai', feeds) return feeds ======================================== Аргументы и Факты: ======================================== The distributed recipe runs, but provides no content. The recipe below runs and provides content. However some Unicode directory and file names are found as type 'bytes' rather than as type 'str', and need two small modifications in news.py to handle this. The modified code will handle both 'str' and 'bytes' types. I will suggest these changes to the development forum for inclusion in Calibre, but if you have local development code and need the Аргументы и Факты recipe you need only make the changes below. I will also try to tidy the recipe further now that is working, and post a tidied recipe. 1) in canonicalize_internal_url(self, url, is_link=True): replace return frozenset([(parts.netloc, (parts.path or '').rstrip('/'))]) by zzp = parts.path zzn = parts.netloc if type(zzp) != type(' '): #"<class 'bytes'>": zzp = parts.path.decode("utf-8") zzn = parts.netloc.decode("utf-8") return frozenset([(zzn, (zzp or '').rstrip('/'))]) 2) In article_downloaded(self, request, result): replace index = os.path.join(os.path.dirname(result[0]), 'index.html') by zzr = result[0] if type(zzr) != type(' '): zzr = result[0].decode("utf-8") index = os.path.join(os.path.dirname(zzr), 'index.html') ======================================== #!/usr/bin/env python # vim:fileencoding=utf-8 from __future__ import with_statement, unicode_literals from calibre.web.feeds.news import BasicNewsRecipe from calibre.web.fetch.simple import ( AbortArticle, RecursiveFetcher, option_parser as web2disk_option_parser ) import string as st import calibre.web.feeds.news import os, sys dir(BeautifulSoup) class AdvancedUserRecipe1592177429(BasicNewsRecipe): title = 'Аргументы и Факты' encoding = 'utf8' language = 'ru' oldest_article = 7 max_articles_per_feed = 25 auto_cleanup = True verbose = 3 feeds = [ ('AIF', 'https://www.aif.ru/rss/all.php'), ] INDEX = 'https://www.aif.ru/rss/all.php' def preprocess_html(self, soup): soup = BasicNewsRecipe.preprocess_html(self, soup) return soup def preprocess_raw_html(self, raw_html, url): raw_html = BasicNewsRecipe.preprocess_raw_html(self, raw_html, url) return raw_html def fetch_article(self, url, dir_, f, a, num_of_feeds): br = self.browser if hasattr(self.get_browser, 'is_base_class_implementation'): # We are using the default get_browser, which means no need to # clone br = BasicNewsRecipe.get_browser(self) else: br = self.clone_browser(self.browser) self.web2disk_options.browser = br fetcher = RecursiveFetcher(self.web2disk_options, self.log, # BasicNewsRecipe. self.image_map, self.css_map, (url, f, a, num_of_feeds)) fetcher.browser = br fetcher.base_dir = dir_ fetcher.current_dir = dir_ fetcher.show_progress = False fetcher.image_url_processor = self.image_url_processor res, path, failures = fetcher.start_fetch(url.decode()), fetcher.downloaded_paths, fetcher.failed_links res = res.encode("utf-8") path[0] = path[0].encode() if not res or not os.path.exists(res): msg = _('Could not fetch article.') + ' ' if self.debug: msg += _('The debug traceback is available earlier in this log') else: msg += _('Run with -vv to see the reason') raise Exception(msg) return res, path, failures def parse_index(self): feeds = [] section_title = u'aif' articles = [] soup = self.index_to_soup(self.INDEX) ii = 0 for item in soup.findAll('item'): if ii < self.max_articles_per_feed: try: ii = ii + 1 A = str(item) i = A.find(u'link') j = A.find(u'description') ZZ = item.find('description') ZZ1 = str(ZZ) ZZ2 = ZZ1[24:-19] AB = A AB1 = AB[i:j].encode() AU = AB1 try: articles.append({'url':AU[6:-2], 'title':ZZ2}) except Exception as inst: self.log("Exception handled!") except Exception as inst: self.log("Exception handled!") if articles: feeds.append((section_title, articles)) return feeds |
09-10-2021, 08:58 AM | #6 |
creator of calibre
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You should not be passing bytes to those functions. Dont encode things in your recipe.
|
09-10-2021, 09:13 AM | #7 |
Connoisseur
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
|
Unfortunately the byte strings seem to arise from links within the downloaded articles, not from the links to the downloaded articles generated by the my Calibre recipe. The recipe for Аргументы и Факты distributed currently with Calibre generates only a table of contents but no article content. It has not worked since Calibre moved to Python 3. I had it working with the old Python 2 Calibre without needing to handle byte strings, but I think the distributed recipe was also not working then either when I first tried to use it, and that I had also needed to rewrite that recipe.
|
09-10-2021, 09:28 AM | #8 |
creator of calibre
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I have modified canonicalize_internal_url to handle byte strings, however I really dont see how fetch_article could be returning bytestrings, unless your recipe is doing so, and looking at your recipe source, you are indeed encoding things to bytes.
|
09-10-2021, 11:51 AM | #9 |
Connoisseur
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
|
Thanks. I've now removed all encode's and decode's as well as my modified fetch_article, and tested without the suggested modifications to news.py - all now runs successfully. The need to handle byte strings must have arisen during development of the recipe but not been necessary in the final recipe. The simplified recipe follows below. I'll try to tidy it up further and update in the next day or two.
#!/usr/bin/env python # vim:fileencoding=utf-8 from __future__ import with_statement, unicode_literals from calibre.web.feeds.news import BasicNewsRecipe from calibre.web.fetch.simple import ( AbortArticle, RecursiveFetcher, option_parser as web2disk_option_parser ) import string as st import calibre.web.feeds.news import os, sys dir(BeautifulSoup) class AdvancedUserRecipe1592177429(BasicNewsRecipe): title = 'Аргументы и Факты' encoding = 'utf8' language = 'ru' oldest_article = 7 max_articles_per_feed = 25 auto_cleanup = True verbose = 3 feeds = [ ('AIF', 'https://www.aif.ru/rss/all.php'), ] INDEX = 'https://www.aif.ru/rss/all.php' def preprocess_html(self, soup): soup = BasicNewsRecipe.preprocess_html(self, soup) return soup def preprocess_raw_html(self, raw_html, url): raw_html = BasicNewsRecipe.preprocess_raw_html(self, raw_html, url) return raw_html def parse_index(self): feeds = [] section_title = u'aif' articles = [] soup = self.index_to_soup(self.INDEX) ii = 0 for item in soup.findAll('item'): if ii < self.max_articles_per_feed: try: ii = ii + 1 A = str(item) i = A.find(u'link') j = A.find(u'description') ZZ = item.find('description') ZZ1 = str(ZZ) ZZ2 = ZZ1[24:-19] AB = A AB1 = AB[i:j] AU = AB1 try: articles.append({'url':AU[6:-2], 'title':ZZ2}) except Exception as inst: self.log("Exception handled!") except Exception as inst: self.log("Exception handled!") if articles: feeds.append((section_title, articles)) return feeds |
09-10-2021, 01:09 PM | #10 |
Connoisseur
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
|
Now tidied and posted as a new thread at https://www.mobileread.com/forums/sh...96#post4153196
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Recipe works when mocked up as Python file, fails when converted to Recipe | ode | Recipes | 7 | 09-04-2011 04:57 AM |
Recipe Please | gagw | Recipes | 0 | 01-24-2011 07:24 AM |
I need some help with a recipe | jefferson_frantz | Recipes | 14 | 11-22-2010 02:06 PM |
New recipe | kiklop74 | Recipes | 0 | 10-05-2010 04:41 PM |
New recipe | kiklop74 | Recipes | 0 | 10-01-2010 02:42 PM |