Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 03-01-2011, 03:17 PM   #1
oneillpt
Connoisseur
oneillpt began at the beginning.
 
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
Recipe for Helsingin Sanomat

Certainly a minority linguistic interest, there being no Finnish news included yet with Calibre, but may also be of use as an example to anyone encountering problems with new recipes due to HTML tables in the feed content.

Helsingin Sanomat places the feed content within HTML <table> tags. Without the "'linearize_tables' : True" conversions_options below this would result in an e-book in mobi format which shows a single page only for each article both on Kindle and in the MobiPocket reader for PC, losing the rest of each article after the part which fits on that first page.

The recipe also illustrates handling of printable page versions (the "tulosta" below) where the RSS feeds supply the page URL needed in two different forms, with or without a "?ref=rss" at the end.


Code:
class AdvancedUserRecipe1298137661(BasicNewsRecipe):
  title          = u'Helsingin Sanomat'
  oldest_article = 7
  max_articles_per_feed = 100
  no_stylesheets = True
  remove_javascript     = True
  conversion_options = {
                         'linearize_tables' : True 
                       }
  remove_tags = [
                  dict(name='a', attrs={'id':'articleCommentUrl'}),
                  dict(name='p', attrs={'class':'newsSummary'}),
                  dict(name='div', attrs={'class':'headerTools'})
                ]

  feeds          = [(u'Uutiset - HS.fi', u'http://www.hs.fi/uutiset/rss/'), (u'Politiikka - HS.fi', u'http://www.hs.fi/politiikka/rss/'),
                     (u'Ulkomaat - HS.fi', u'http://www.hs.fi/ulkomaat/rss/'), (u'Kulttuuri - HS.fi', u'http://www.hs.fi/kulttuuri/rss/'),
                     (u'Kirjat - HS.fi', u'http://www.hs.fi/kulttuuri/kirjat/rss/'), (u'Elokuvat - HS.fi', u'http://www.hs.fi/kulttuuri/elokuvat/rss/')
                     ]

  def print_version(self, url):
    j = url.rfind("/")
    s = url[j:]
    i = s.rfind("?ref=rss")
    if i > 0:
      s = s[:i]
    return "http://www.hs.fi/tulosta" + s
oneillpt is offline   Reply With Quote
Old 10-12-2011, 06:24 AM   #2
Tragos
Junior Member
Tragos began at the beginning.
 
Posts: 2
Karma: 10
Join Date: May 2011
Device: Kindle 3 Wifi
This recipe is not working anymore as Helsingin Sanomat has changed their website structure. Nowadays the print versions of the pages are created using JavaScript.
Tragos is offline   Reply With Quote
Old 10-14-2011, 10:48 AM   #3
oneillpt
Connoisseur
oneillpt began at the beginning.
 
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
Quote:
Originally Posted by Tragos View Post
This recipe is not working anymore as Helsingin Sanomat has changed their website structure. Nowadays the print versions of the pages are created using JavaScript.
Here is a revised version, which extracts the main news (Uutiset) section. However, the book (Kirjat) and cinema (Elokuvat) sections, which were still being extracted by the original version are broken by this revision.

Spoiler:
Code:
class AdvancedUserRecipe1298137661(BasicNewsRecipe):
  title          = u'Helsingin Sanomat'
  __author__ = 'oneillpt custom'
  language              = 'fi'
  oldest_article = 7
  max_articles_per_feed = 100
  no_stylesheets = True
  remove_javascript     = True
  conversion_options = {
                         'linearize_tables' : True 
                       }
  #remove_tags = [
  #                dict(name='a', attrs={'id':'articleCommentUrl'}),
  #                dict(name='p', attrs={'class':'newsSummary'}),
  #                dict(name='div', attrs={'class':'headerTools'})
  #              ]
  keep_only_tags = [dict(name='div', attrs={'id':'main-content'})]

  feeds          = [(u'Uutiset - HS.fi', u'http://www.hs.fi/uutiset/rss/')
#, (u'Politiikka - HS.fi', u'http://www.hs.fi/politiikka/rss/'),
#                     (u'Ulkomaat - HS.fi', u'http://www.hs.fi/ulkomaat/rss/'), #(u'Kulttuuri - HS.fi', u'http://www.hs.fi/kulttuuri/rss/'),
#                     (u'Kirjat - HS.fi', u'http://www.hs.fi/kulttuuri/kirjat/rss/'), #(u'Elokuvat - HS.fi', u'http://www.hs.fi/kulttuuri/elokuvat/rss/')
                     ]

  #def print_version(self, url):
  #  j = url.rfind("/")
  #  s = url[j:]
  #  i = s.rfind("?ref=rss")
  #  if i > 0:
  #    s = s[:i]
  #  return "http://www.hs.fi/tulosta" + s


The revision is made by removing the remove_tags lines, adding a keep_only_tags line, and removing the print_version definition. I have retained the removed lines as comments, and commented the feeds which are not working now. I'll post a new version if I can make these feeds work with the same recipe which now works for the main news feed.
oneillpt is offline   Reply With Quote
Old 10-14-2011, 11:16 AM   #4
oneillpt
Connoisseur
oneillpt began at the beginning.
 
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
Quote:
Originally Posted by oneillpt View Post
I'll post a new version if I can make these feeds work with the same recipe which now works for the main news feed.
Change the keep_only_tags to:

Code:
keep_only_tags = [dict(name='div', attrs={'id':'main-content'}),
    dict(name='div', attrs={'class':'contentNewsArticle'})]
and remove the commenting from the remaining feeds.

All sections except politics (Politiikka) extract. As there is no content at present in the Politiikka feed, I hope it too will extract when there is content.
oneillpt is offline   Reply With Quote
Old 09-10-2021, 08:34 AM   #5
oneillpt
Connoisseur
oneillpt began at the beginning.
 
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
Updated recipes for Helsingin Sanomat and Аргументы и Факты

NOTE THAT THE UPDATED RECIPE FOR Аргументы и Факты REQUIRES TWO SMALL CHANGES TO CALIBRE SOURCE CODE, DISCUSSED BELOW

Helsingin Sanomat:
========================================
This recipe provides four sections of the paper (five on Sunday)
========================================

#!/usr/bin/env python2
# vim:fileencoding=utf-8
from __future__ import unicode_literals, division, absolute_import, print_function
from datetime import date
from calibre.web.feeds.news import BasicNewsRecipe

class AdvancedUserRecipe1631181034(BasicNewsRecipe):
title = 'Helsingin Sanomat'
language = 'fi'
oldest_article = 7
max_articles_per_feed = 200
auto_cleanup = True

feeds = [
('Helsingin Sanomat', 'https://www.hs.fi'),
]
INDEX = 'https://www.hs.fi/'

def do_Section(self, nxtINDEX, section_title, feeds):
articles = []
soup = self.index_to_soup(nxtINDEX)
ii = 0
for section in soup.findAll('a', attrs={'class':'block'}):
if section is not None:
ii = ii + 1
z = section.findAll('h2')
try:
z = z[0].get_text() # strip=True
link = section['href']
if link[0:1] == '/':
link = 'https://www.hs.fi' + link
articles.append({u'title':z, u'url':link})
except Exception as inst:
self.log("exception handled")
if articles:
feeds.append((section_title, articles))
return feeds

def parse_index(self):
feeds = []
self.do_Section('https://www.hs.fi/', u'Etusivi', feeds)
self.do_Section('https://www.hs.fi/kotimaa/', u'Kotimaa', feeds)
self.do_Section('https://www.hs.fi/kulttuuri/', u'Kulttuuri', feeds)
self.do_Section('https://www.hs.fi/ulkomaat/', u'Ulkomaat', feeds)
if date.weekday(date.today()) == 6:
self.do_Section('https://www.hs.fi/sunnuntai/', u'Sunnuntai', feeds)
return feeds



========================================
Аргументы и Факты:
========================================
The distributed recipe runs, but provides no content. The recipe
below runs and provides content. However some Unicode directory
and file names are found as type 'bytes' rather than as type 'str',
and need two small modifications in news.py to handle this. The
modified code will handle both 'str' and 'bytes' types. I will suggest
these changes to the development forum for inclusion in Calibre, but
if you have local development code and need the Аргументы и Факты
recipe you need only make the changes below. I will also try to tidy the recipe further now that is working, and post a tidied recipe.

1) in canonicalize_internal_url(self, url, is_link=True):
replace
return frozenset([(parts.netloc, (parts.path or '').rstrip('/'))])
by
zzp = parts.path
zzn = parts.netloc
if type(zzp) != type(' '): #"<class 'bytes'>":
zzp = parts.path.decode("utf-8")
zzn = parts.netloc.decode("utf-8")
return frozenset([(zzn, (zzp or '').rstrip('/'))])

2) In article_downloaded(self, request, result):
replace
index = os.path.join(os.path.dirname(result[0]), 'index.html')
by
zzr = result[0]
if type(zzr) != type(' '):
zzr = result[0].decode("utf-8")
index = os.path.join(os.path.dirname(zzr), 'index.html')
========================================

#!/usr/bin/env python
# vim:fileencoding=utf-8
from __future__ import with_statement, unicode_literals
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.web.fetch.simple import (
AbortArticle, RecursiveFetcher, option_parser as web2disk_option_parser
)
import string as st
import calibre.web.feeds.news
import os, sys
dir(BeautifulSoup)

class AdvancedUserRecipe1592177429(BasicNewsRecipe):
title = 'Аргументы и Факты'
encoding = 'utf8'
language = 'ru'
oldest_article = 7
max_articles_per_feed = 25
auto_cleanup = True
verbose = 3

feeds = [
('AIF', 'https://www.aif.ru/rss/all.php'),
]
INDEX = 'https://www.aif.ru/rss/all.php'

def preprocess_html(self, soup):
soup = BasicNewsRecipe.preprocess_html(self, soup)
return soup
def preprocess_raw_html(self, raw_html, url):
raw_html = BasicNewsRecipe.preprocess_raw_html(self, raw_html, url)
return raw_html
def fetch_article(self, url, dir_, f, a, num_of_feeds):
br = self.browser
if hasattr(self.get_browser, 'is_base_class_implementation'):
# We are using the default get_browser, which means no need to
# clone
br = BasicNewsRecipe.get_browser(self)
else:
br = self.clone_browser(self.browser)
self.web2disk_options.browser = br
fetcher = RecursiveFetcher(self.web2disk_options, self.log, # BasicNewsRecipe.
self.image_map, self.css_map,
(url, f, a, num_of_feeds))
fetcher.browser = br
fetcher.base_dir = dir_
fetcher.current_dir = dir_
fetcher.show_progress = False
fetcher.image_url_processor = self.image_url_processor
res, path, failures = fetcher.start_fetch(url.decode()), fetcher.downloaded_paths, fetcher.failed_links
res = res.encode("utf-8")
path[0] = path[0].encode()
if not res or not os.path.exists(res):
msg = _('Could not fetch article.') + ' '
if self.debug:
msg += _('The debug traceback is available earlier in this log')
else:
msg += _('Run with -vv to see the reason')
raise Exception(msg)

return res, path, failures

def parse_index(self):
feeds = []
section_title = u'aif'
articles = []
soup = self.index_to_soup(self.INDEX)
ii = 0
for item in soup.findAll('item'):
if ii < self.max_articles_per_feed:
try:
ii = ii + 1
A = str(item)
i = A.find(u'link')
j = A.find(u'description')
ZZ = item.find('description')
ZZ1 = str(ZZ)
ZZ2 = ZZ1[24:-19]
AB = A
AB1 = AB[i:j].encode()
AU = AB1
try:
articles.append({'url':AU[6:-2], 'title':ZZ2})
except Exception as inst:
self.log("Exception handled!")
except Exception as inst:
self.log("Exception handled!")
if articles:
feeds.append((section_title, articles))
return feeds
oneillpt is offline   Reply With Quote
Old 09-10-2021, 08:58 AM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You should not be passing bytes to those functions. Dont encode things in your recipe.
kovidgoyal is offline   Reply With Quote
Old 09-10-2021, 09:13 AM   #7
oneillpt
Connoisseur
oneillpt began at the beginning.
 
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
Unfortunately the byte strings seem to arise from links within the downloaded articles, not from the links to the downloaded articles generated by the my Calibre recipe. The recipe for Аргументы и Факты distributed currently with Calibre generates only a table of contents but no article content. It has not worked since Calibre moved to Python 3. I had it working with the old Python 2 Calibre without needing to handle byte strings, but I think the distributed recipe was also not working then either when I first tried to use it, and that I had also needed to rewrite that recipe.
oneillpt is offline   Reply With Quote
Old 09-10-2021, 09:28 AM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I have modified canonicalize_internal_url to handle byte strings, however I really dont see how fetch_article could be returning bytestrings, unless your recipe is doing so, and looking at your recipe source, you are indeed encoding things to bytes.
kovidgoyal is offline   Reply With Quote
Old 09-10-2021, 11:51 AM   #9
oneillpt
Connoisseur
oneillpt began at the beginning.
 
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
Thanks. I've now removed all encode's and decode's as well as my modified fetch_article, and tested without the suggested modifications to news.py - all now runs successfully. The need to handle byte strings must have arisen during development of the recipe but not been necessary in the final recipe. The simplified recipe follows below. I'll try to tidy it up further and update in the next day or two.

#!/usr/bin/env python
# vim:fileencoding=utf-8
from __future__ import with_statement, unicode_literals
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.web.fetch.simple import (
AbortArticle, RecursiveFetcher, option_parser as web2disk_option_parser
)
import string as st
import calibre.web.feeds.news
import os, sys
dir(BeautifulSoup)

class AdvancedUserRecipe1592177429(BasicNewsRecipe):
title = 'Аргументы и Факты'
encoding = 'utf8'
language = 'ru'
oldest_article = 7
max_articles_per_feed = 25
auto_cleanup = True
verbose = 3

feeds = [
('AIF', 'https://www.aif.ru/rss/all.php'),
]
INDEX = 'https://www.aif.ru/rss/all.php'

def preprocess_html(self, soup):
soup = BasicNewsRecipe.preprocess_html(self, soup)
return soup
def preprocess_raw_html(self, raw_html, url):
raw_html = BasicNewsRecipe.preprocess_raw_html(self, raw_html, url)
return raw_html

def parse_index(self):
feeds = []
section_title = u'aif'
articles = []
soup = self.index_to_soup(self.INDEX)
ii = 0
for item in soup.findAll('item'):
if ii < self.max_articles_per_feed:
try:
ii = ii + 1
A = str(item)
i = A.find(u'link')
j = A.find(u'description')
ZZ = item.find('description')
ZZ1 = str(ZZ)
ZZ2 = ZZ1[24:-19]
AB = A
AB1 = AB[i:j]
AU = AB1
try:
articles.append({'url':AU[6:-2], 'title':ZZ2})
except Exception as inst:
self.log("Exception handled!")
except Exception as inst:
self.log("Exception handled!")
if articles:
feeds.append((section_title, articles))
return feeds
oneillpt is offline   Reply With Quote
Old 09-10-2021, 01:09 PM   #10
oneillpt
Connoisseur
oneillpt began at the beginning.
 
Posts: 62
Karma: 46
Join Date: Feb 2011
Device: Kindle 3 (cracked screen!); PW1; Oasis
Now tidied and posted as a new thread at https://www.mobileread.com/forums/sh...96#post4153196
oneillpt is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Recipe works when mocked up as Python file, fails when converted to Recipe ode Recipes 7 09-04-2011 04:57 AM
Recipe Please gagw Recipes 0 01-24-2011 07:24 AM
I need some help with a recipe jefferson_frantz Recipes 14 11-22-2010 02:06 PM
New recipe kiklop74 Recipes 0 10-05-2010 04:41 PM
New recipe kiklop74 Recipes 0 10-01-2010 02:42 PM


All times are GMT -4. The time now is 06:23 PM.


MobileRead.com is a privately owned, operated and funded community.