MobileRead Forums - View Single Post

bosplans · 03-29-2011, 07:00 PM

Quote:

Originally Posted by kovidgoyal

You might find it easier to extract the correct link from the descriptions in the RSS feed. To do that implement the get_article_url function in your recipe.

Something like

from calibre import browser

Code:

def get_article_url(self, article):
   original_url = BasicNewsRecipe.get_article_url(self, article)
   raw = browser().open_novisit(original_url).read()
   soup = BeautifulSoup(raw)
   # Find the link to the actual article in the soup and return that

I tried different implementation of the get_article_url but no one worked out... looks like it can not find the real link to the article...

Here the log:

Code:

esolved conversion options
calibre version: 0.7.50
{'asciiize': False,
 'author_sort': None,
 'authors': None,
 'base_font_size': 0,
 'book_producer': None,
 'change_justification': 'original',
 'chapter': None,
 'chapter_mark': 'pagebreak',
 'comments': None,
 'cover': None,
 'debug_pipeline': None,
 'dehyphenate': True,
 'delete_blank_paragraphs': True,
 'disable_font_rescaling': False,
 'dont_download_recipe': False,
 'enable_heuristics': False,
 'extra_css': None,
 'fix_indents': True,
 'font_size_mapping': None,
 'format_scene_breaks': True,
 'html_unwrap_factor': 0.4,
 'input_encoding': None,
 'input_profile': <calibre.customize.profiles.InputProfile object at 0x1087c0510>,
 'insert_blank_line': False,
 'insert_metadata': False,
 'isbn': None,
 'italicize_common_cases': True,
 'keep_ligatures': False,
 'language': None,
 'level1_toc': None,
 'level2_toc': None,
 'level3_toc': None,
 'line_height': 0,
 'linearize_tables': False,
 'lrf': False,
 'margin_bottom': 5.0,
 'margin_left': 5.0,
 'margin_right': 5.0,
 'margin_top': 5.0,
 'markup_chapter_headings': True,
 'max_toc_links': 50,
 'minimum_line_height': 120.0,
 'no_chapters_in_toc': False,
 'no_inline_navbars': False,
 'output_profile': <calibre.customize.profiles.OutputProfile object at 0x1087c08d0>,
 'page_breaks_before': None,
 'password': None,
 'prefer_metadata_cover': False,
 'pretty_print': True,
 'pubdate': None,
 'publisher': None,
 'rating': None,
 'read_metadata_from_opf': None,
 'remove_first_image': False,
 'remove_paragraph_spacing': False,
 'remove_paragraph_spacing_indent_size': 1.5,
 'renumber_headings': True,
 'replace_scene_breaks': '',
 'series': None,
 'series_index': None,
 'smarten_punctuation': False,
 'sr1_replace': '',
 'sr1_search': '',
 'sr2_replace': '',
 'sr2_search': '',
 'sr3_replace': '',
 'sr3_search': '',
 'tags': None,
 'test': True,
 'timestamp': None,
 'title': None,
 'title_sort': None,
 'toc_filter': None,
 'toc_threshold': 6,
 'unwrap_lines': True,
 'use_auto_toc': False,
 'username': None,
 'verbose': 2}
1% Conversione dell'input in HTML...
InputFormatPlugin: Recipe Input running
1% Scaricamento feed...
1% Scaricamento feed Notizie Italia...
1% Scaricamento feed Notizie Europa...
1% Tentativo di scaricamento della copertina...
34% Scaricamento copertina da http://www.shopping24.ilsole24ore.com/ProductRelated/rds/img/logo_sole.gif
1% Preparazione dell'immagine principale in corso
Synthesizing mastheadImage
1% Inizio scaricamento [4 articoli]...
34% Feed scaricati in /private/var/folders/2u/2uk5pqalE4KW7TyOCw848k+++TI/-Tmp-/calibre_0.7.50_tmp_vNAf0b/calibre_0.7.50_s70ldH_plumber/index.html
34% Scaricamento completato
Parsing all content...
Parsing index.html ...
Forcing index.html into XHTML namespace
Parsing feed_0/index.html ...
Initial parse failed:
Traceback (most recent call last):
  File "site-packages/calibre/ebooks/oeb/base.py", line 886, in first_pass
  File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48634)
  File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:72245)
  File "parser.pxi", line 1417, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:71041)
  File "parser.pxi", line 898, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:67581)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64257)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65178)
  File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64521)
XMLSyntaxError: Opening and ending tag mismatch: hr line 30 and div, line 31, column 7

Parsing file 'feed_0/index.html' as HTML
Forcing feed_0/index.html into XHTML namespace
Parsing feed_1/index.html ...
Initial parse failed:
Traceback (most recent call last):
  File "site-packages/calibre/ebooks/oeb/base.py", line 886, in first_pass
  File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48634)
  File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:72245)
  File "parser.pxi", line 1417, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:71041)
  File "parser.pxi", line 898, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:67581)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64257)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65178)
  File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64521)
XMLSyntaxError: Opening and ending tag mismatch: hr line 30 and div, line 31, column 7

Parsing file 'feed_1/index.html' as HTML
Forcing feed_1/index.html into XHTML namespace
Reading TOC from NCX...
34% Transcodifica di un ebook in corso...
Merging user specified metadata...
Detecting structure...
Flattening CSS and remapping font sizes...
Source base font size is 12.00000pt
Cleaning up manifest...
Trimming unused files from manifest...
Parsing stylesheet.css ...
Creating OEB Output...
67% Creazione in corso OEB Output
The cover image has an id != "cover". Renaming to work around bug in Nook Color
OEB output written to /Users/marco/Dropbox/Public/Ricette Calibre/output_dir
Output salvato in   /Users/marco/Dropbox/Public/Ricette Calibre/output_dir
Mac-mini-di-Marco-Saraceno:Ricette Calibre marco$

And here the recipe code last used:

Code:

__author__    = 'Marco Saraceno'
__copyright__ = '2010, Marco Saraceno <marcosaraceno at gmail.com>'
description   = 'Italian daily newspaper - v 1.1 (Mar14,2011)'

'''
http://www.ilsole24ore.com
'''
from calibre import browser
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup
class IlSole24Ore(BasicNewsRecipe):
    __author__        = 'Marco Saraceno'
    description   = 'Italian financial daily newspaper'

    cover_url      = 'http://www.shopping24.ilsole24ore.com/ProductRelated/rds/img/logo_sole.gif'
    title          = u'Il Sole 24 Ore'
    publisher      = 'Gruppo editoriale GRUPPO 24ORE'
    category       = 'News, politics, culture, economy, financial, Italian'

    language       = 'it'
    timefmt        = '[%a, %d %b, %Y]'

    oldest_article = 2
    max_articles_per_feed = 100
    use_embedded_content  = False
    extra_css      = '.headline {font-size: x-large;} \n .fact { padding-top: 10pt  }'

         
    remove_tags = [
                            dict(name='div', attrs={'class':['header','titolo']}),
                            dict(name='table', attrs={'class':['footer1024','footerdown']}),
                            ]
                            

    def get_article_url(self, article):
        original_url = BasicNewsRecipe.get_article_url(self, article)
        raw = browser().open_novisit(original_url).read()
        soup = BeautifulSoup(raw)
        # Find the link to the actual article in the soup and return that

    feeds = [
                  (u'Notizie Italia', u'http://www.ilsole24ore.com/rss/notizie/italia.xml'),
                  (u'Notizie Europa', u'http://www.ilsole24ore.com/rss/notizie/europa.xml'),
                  (u'Notizie USA', u'http://www.ilsole24ore.com/rss/notizie/usa.xml'),
                  (u'Notizie Americhe', u'http://www.ilsole24ore.com/rss/notizie/americhe.xml'),
                  (u'Notizie Medio Oriente e Africa', u'http://www.ilsole24ore.com/rss/notizie/medio-oriente-e-africa.xml'),
                  (u'Notizie Asia e Oceania', u'http://www.ilsole24ore.com/rss/notizie/asia-e-oceania.xml'),
                  (u'Commenti', u'http://www.ilsole24ore.com/rss/commenti-e-idee.xml'),
                  (u'Norme e tributi', u'http://www.ilsole24ore.com/rss/norme-e-tributi.xml'),
                  (u'Finanza', u'http://www.ilsole24ore.com/rss/finanza-e-mercati.xml'),
                  (u'Economia', u'http://www.ilsole24ore.com/rss/economia.xml'),
                  (u'Tecnologia', u'http://www.ilsole24ore.com/rss/tecnologie.xml'),
                  (u'Cultura', u'http://www.ilsole24ore.com/rss/cultura.xml'),
                  ]

Any suggestion?