Fix a recipe

bosplans · 03-29-2011, 08:49 AM

Hi,

The recipe I made some month ago is not working anymore, since the newspaper implemented a new feed service called "feedportal.com" which screw the links to the proper articles, forwarding the visitor to ads ...

I figure out how to solve the problem in theory, but I do not know regex, unfortunately. The idea was to use the def print_version(self, url) and convert the following link from:

http://rss.feedsportal.com/c/32276/f...24ore0N0Cart0Cnotizie0C20A110E0A30E290Clampedusa0Eabitanti0Eoccupano0Emunicipio0E11570A60Bshtml0Duuid0FAauakRKD/story01.htm

To:

http://www.ilsole24ore.com/art/notizie/2011-03-29/lampedusa-abitanti-occupano-municipio-115706_PRN.shtml

As you can see in the former article link there are all the info needed for the conversion ... but I have no idea how to make the magic! Someone can help me or tell where to find a good resource to learn the principles? Is it possible or there are easier workaround?

Thanks in advance!

The original recipe:

Code:

__author__    = 'Marco Saraceno'
__copyright__ = '2010, Marco Saraceno <marcosaraceno at gmail.com>'
description   = 'Italian daily newspaper - v 1.1 (Mar14,2011)'

'''
http://www.ilsole24ore.com
'''

class IlSole24Ore(BasicNewsRecipe):
    __author__        = 'Marco Saraceno'
    description   = 'Italian financial daily newspaper'

    cover_url      = 'http://www.shopping24.ilsole24ore.com/ProductRelated/rds/img/logo_sole.gif'
    title          = u'Il Sole 24 Ore'
    publisher      = 'Gruppo editoriale GRUPPO 24ORE'
    category       = 'News, politics, culture, economy, financial, Italian'

    language       = 'it'
    timefmt        = '[%a, %d %b, %Y]'

    oldest_article = 2
    max_articles_per_feed = 100
    use_embedded_content  = False
    recursion             = 10
    extra_css      = '.headline {font-size: x-large;} \n .fact { padding-top: 10pt  }'

         
    remove_tags = [
                            dict(name='div', attrs={'class':['header','titolo']}),
                            dict(name='table', attrs={'class':['footer1024','footerdown']}),
                           ]

    feeds = [
                  (u'Notizie Italia', u'http://www.ilsole24ore.com/rss/notizie/italia.xml'),
				  (u'Notizie Europa', u'http://www.ilsole24ore.com/rss/notizie/europa.xml'),
				  (u'Notizie USA', u'http://www.ilsole24ore.com/rss/notizie/usa.xml'),
				  (u'Notizie Americhe', u'http://www.ilsole24ore.com/rss/notizie/americhe.xml'),
				  (u'Notizie Medio Oriente e Africa', u'http://www.ilsole24ore.com/rss/notizie/medio-oriente-e-africa.xml'),
				  (u'Notizie Asia e Oceania', u'http://www.ilsole24ore.com/rss/notizie/asia-e-oceania.xml'),
                  (u'Commenti', u'http://www.ilsole24ore.com/rss/commenti-e-idee.xml'),
                  (u'Norme e tributi', u'http://www.ilsole24ore.com/rss/norme-e-tributi.xml'),
                  (u'Finanza', u'http://www.ilsole24ore.com/rss/finanza-e-mercati.xml'),
                  (u'Economia', u'http://www.ilsole24ore.com/rss/economia.xml'),
                  (u'Tecnologia', u'http://www.ilsole24ore.com/rss/tecnologie.xml'),
                  (u'Cultura', u'http://www.ilsole24ore.com/rss/cultura.xml'),
                ]

    def print_version(self, url):
          return url.replace('.shtml', '_PRN.shtml')

kovidgoyal · 03-29-2011, 10:28 AM

You might find it easier to extract the correct link from the descriptions in the RSS feed. To do that implement the get_article_url function in your recipe.

Something like

from calibre import browser

Code:

def get_article_url(self, article):
   original_url = BasicNewsRecipe.get_article_url(self, article)
   raw = browser().open_novisit(original_url).read()
   soup = BeautifulSoup(raw)
   # Find the link to the actual article in the soup and return that

bosplans · 03-29-2011, 12:45 PM

Thank you for your prompt reply. I try right away!
Btw, where can I find a good howto to coding recipes and the syntax of all the functions? I am a newbe :-(

Starson17 · 03-29-2011, 01:52 PM

Quote:

Originally Posted by bosplans

where can I find a good howto to coding recipes and the syntax of all the functions?

Start here:
http://calibre-ebook.com/user_manual/news.html
Also, see the links at the end of that page.

bosplans · 03-29-2011, 07:00 PM

Quote:

Originally Posted by kovidgoyal

You might find it easier to extract the correct link from the descriptions in the RSS feed. To do that implement the get_article_url function in your recipe.

Something like

from calibre import browser

Code:

def get_article_url(self, article):
   original_url = BasicNewsRecipe.get_article_url(self, article)
   raw = browser().open_novisit(original_url).read()
   soup = BeautifulSoup(raw)
   # Find the link to the actual article in the soup and return that

I tried different implementation of the get_article_url but no one worked out... looks like it can not find the real link to the article...

Here the log:

Code:

esolved conversion options
calibre version: 0.7.50
{'asciiize': False,
 'author_sort': None,
 'authors': None,
 'base_font_size': 0,
 'book_producer': None,
 'change_justification': 'original',
 'chapter': None,
 'chapter_mark': 'pagebreak',
 'comments': None,
 'cover': None,
 'debug_pipeline': None,
 'dehyphenate': True,
 'delete_blank_paragraphs': True,
 'disable_font_rescaling': False,
 'dont_download_recipe': False,
 'enable_heuristics': False,
 'extra_css': None,
 'fix_indents': True,
 'font_size_mapping': None,
 'format_scene_breaks': True,
 'html_unwrap_factor': 0.4,
 'input_encoding': None,
 'input_profile': <calibre.customize.profiles.InputProfile object at 0x1087c0510>,
 'insert_blank_line': False,
 'insert_metadata': False,
 'isbn': None,
 'italicize_common_cases': True,
 'keep_ligatures': False,
 'language': None,
 'level1_toc': None,
 'level2_toc': None,
 'level3_toc': None,
 'line_height': 0,
 'linearize_tables': False,
 'lrf': False,
 'margin_bottom': 5.0,
 'margin_left': 5.0,
 'margin_right': 5.0,
 'margin_top': 5.0,
 'markup_chapter_headings': True,
 'max_toc_links': 50,
 'minimum_line_height': 120.0,
 'no_chapters_in_toc': False,
 'no_inline_navbars': False,
 'output_profile': <calibre.customize.profiles.OutputProfile object at 0x1087c08d0>,
 'page_breaks_before': None,
 'password': None,
 'prefer_metadata_cover': False,
 'pretty_print': True,
 'pubdate': None,
 'publisher': None,
 'rating': None,
 'read_metadata_from_opf': None,
 'remove_first_image': False,
 'remove_paragraph_spacing': False,
 'remove_paragraph_spacing_indent_size': 1.5,
 'renumber_headings': True,
 'replace_scene_breaks': '',
 'series': None,
 'series_index': None,
 'smarten_punctuation': False,
 'sr1_replace': '',
 'sr1_search': '',
 'sr2_replace': '',
 'sr2_search': '',
 'sr3_replace': '',
 'sr3_search': '',
 'tags': None,
 'test': True,
 'timestamp': None,
 'title': None,
 'title_sort': None,
 'toc_filter': None,
 'toc_threshold': 6,
 'unwrap_lines': True,
 'use_auto_toc': False,
 'username': None,
 'verbose': 2}
1% Conversione dell'input in HTML...
InputFormatPlugin: Recipe Input running
1% Scaricamento feed...
1% Scaricamento feed Notizie Italia...
1% Scaricamento feed Notizie Europa...
1% Tentativo di scaricamento della copertina...
34% Scaricamento copertina da http://www.shopping24.ilsole24ore.com/ProductRelated/rds/img/logo_sole.gif
1% Preparazione dell'immagine principale in corso
Synthesizing mastheadImage
1% Inizio scaricamento [4 articoli]...
34% Feed scaricati in /private/var/folders/2u/2uk5pqalE4KW7TyOCw848k+++TI/-Tmp-/calibre_0.7.50_tmp_vNAf0b/calibre_0.7.50_s70ldH_plumber/index.html
34% Scaricamento completato
Parsing all content...
Parsing index.html ...
Forcing index.html into XHTML namespace
Parsing feed_0/index.html ...
Initial parse failed:
Traceback (most recent call last):
  File "site-packages/calibre/ebooks/oeb/base.py", line 886, in first_pass
  File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48634)
  File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:72245)
  File "parser.pxi", line 1417, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:71041)
  File "parser.pxi", line 898, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:67581)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64257)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65178)
  File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64521)
XMLSyntaxError: Opening and ending tag mismatch: hr line 30 and div, line 31, column 7

Parsing file 'feed_0/index.html' as HTML
Forcing feed_0/index.html into XHTML namespace
Parsing feed_1/index.html ...
Initial parse failed:
Traceback (most recent call last):
  File "site-packages/calibre/ebooks/oeb/base.py", line 886, in first_pass
  File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48634)
  File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:72245)
  File "parser.pxi", line 1417, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:71041)
  File "parser.pxi", line 898, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:67581)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64257)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65178)
  File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64521)
XMLSyntaxError: Opening and ending tag mismatch: hr line 30 and div, line 31, column 7

Parsing file 'feed_1/index.html' as HTML
Forcing feed_1/index.html into XHTML namespace
Reading TOC from NCX...
34% Transcodifica di un ebook in corso...
Merging user specified metadata...
Detecting structure...
Flattening CSS and remapping font sizes...
Source base font size is 12.00000pt
Cleaning up manifest...
Trimming unused files from manifest...
Parsing stylesheet.css ...
Creating OEB Output...
67% Creazione in corso OEB Output
The cover image has an id != "cover". Renaming to work around bug in Nook Color
OEB output written to /Users/marco/Dropbox/Public/Ricette Calibre/output_dir
Output salvato in   /Users/marco/Dropbox/Public/Ricette Calibre/output_dir
Mac-mini-di-Marco-Saraceno:Ricette Calibre marco$

And here the recipe code last used:

Code:

__author__    = 'Marco Saraceno'
__copyright__ = '2010, Marco Saraceno <marcosaraceno at gmail.com>'
description   = 'Italian daily newspaper - v 1.1 (Mar14,2011)'

'''
http://www.ilsole24ore.com
'''
from calibre import browser
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup
class IlSole24Ore(BasicNewsRecipe):
    __author__        = 'Marco Saraceno'
    description   = 'Italian financial daily newspaper'

    cover_url      = 'http://www.shopping24.ilsole24ore.com/ProductRelated/rds/img/logo_sole.gif'
    title          = u'Il Sole 24 Ore'
    publisher      = 'Gruppo editoriale GRUPPO 24ORE'
    category       = 'News, politics, culture, economy, financial, Italian'

    language       = 'it'
    timefmt        = '[%a, %d %b, %Y]'

    oldest_article = 2
    max_articles_per_feed = 100
    use_embedded_content  = False
    extra_css      = '.headline {font-size: x-large;} \n .fact { padding-top: 10pt  }'

         
    remove_tags = [
                            dict(name='div', attrs={'class':['header','titolo']}),
                            dict(name='table', attrs={'class':['footer1024','footerdown']}),
                            ]
                            

    def get_article_url(self, article):
        original_url = BasicNewsRecipe.get_article_url(self, article)
        raw = browser().open_novisit(original_url).read()
        soup = BeautifulSoup(raw)
        # Find the link to the actual article in the soup and return that

    feeds = [
                  (u'Notizie Italia', u'http://www.ilsole24ore.com/rss/notizie/italia.xml'),
                  (u'Notizie Europa', u'http://www.ilsole24ore.com/rss/notizie/europa.xml'),
                  (u'Notizie USA', u'http://www.ilsole24ore.com/rss/notizie/usa.xml'),
                  (u'Notizie Americhe', u'http://www.ilsole24ore.com/rss/notizie/americhe.xml'),
                  (u'Notizie Medio Oriente e Africa', u'http://www.ilsole24ore.com/rss/notizie/medio-oriente-e-africa.xml'),
                  (u'Notizie Asia e Oceania', u'http://www.ilsole24ore.com/rss/notizie/asia-e-oceania.xml'),
                  (u'Commenti', u'http://www.ilsole24ore.com/rss/commenti-e-idee.xml'),
                  (u'Norme e tributi', u'http://www.ilsole24ore.com/rss/norme-e-tributi.xml'),
                  (u'Finanza', u'http://www.ilsole24ore.com/rss/finanza-e-mercati.xml'),
                  (u'Economia', u'http://www.ilsole24ore.com/rss/economia.xml'),
                  (u'Tecnologia', u'http://www.ilsole24ore.com/rss/tecnologie.xml'),
                  (u'Cultura', u'http://www.ilsole24ore.com/rss/cultura.xml'),
                  ]

Any suggestion?

bosplans · 03-31-2011, 08:15 AM

Taking a look into the feed code I figure out the link to the former article (in bold) is present, but it is encoded (I guess). Any idea how can I grab it?
Thanks

Quote:

kovidgoyal · 03-31-2011, 11:17 AM

from urrlib import decode

decode(encoded_url)

03-29-2011, 10:28 AM	#2
kovidgoyal creator of calibre Posts: 43,850 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You might find it easier to extract the correct link from the descriptions in the RSS feed. To do that implement the get_article_url function in your recipe. Something like from calibre import browser Code: def get_article_url(self, article): original_url = BasicNewsRecipe.get_article_url(self, article) raw = browser().open_novisit(original_url).read() soup = BeautifulSoup(raw) # Find the link to the actual article in the soup and return that Last edited by kovidgoyal; 03-29-2011 at 10:35 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Recipe works when mocked up as Python file, fails when converted to Recipe	ode	Recipes	7	09-04-2011 04:57 AM
FIX: New York Times Recipe	bcollier	Recipes	2	08-25-2011 11:31 AM
PRS-950 They can't fix it so I can't keep it	JakesFriend	Sony Reader	43	03-03-2011 10:03 PM
how to fix this error?	themayfairwitch	ePub	2	01-22-2011 03:11 PM
FIX: La Vanguardia Recipe	fms	Recipes	0	01-19-2011 06:22 AM

03-29-2011, 12:45 PM	#3
bosplans Member Posts: 11 Karma: 10 Join Date: Mar 2011 Device: kindle 3	Thank you for your prompt reply. I try right away! Btw, where can I find a good howto to coding recipes and the syntax of all the functions? I am a newbe :-(

03-31-2011, 11:17 AM	#7
kovidgoyal creator of calibre Posts: 43,850 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	from urrlib import decode decode(encoded_url)