03-29-2011, 08:49 AM | #1 |
Member
Posts: 11
Karma: 10
Join Date: Mar 2011
Device: kindle 3
|
Fix a recipe
Hi,
The recipe I made some month ago is not working anymore, since the newspaper implemented a new feed service called "feedportal.com" which screw the links to the proper articles, forwarding the visitor to ads ... I figure out how to solve the problem in theory, but I do not know regex, unfortunately. The idea was to use the def print_version(self, url) and convert the following link from: http://rss.feedsportal.com/c/32276/f...24ore0N0Cart0Cnotizie0C20A110E0A30E290Clampedusa0Eabitanti0Eoccupano0Emunicipio0E11570A60Bshtml0Duuid0FAauakRKD/story01.htm To: http://www.ilsole24ore.com/art/notizie/2011-03-29/lampedusa-abitanti-occupano-municipio-115706_PRN.shtml As you can see in the former article link there are all the info needed for the conversion ... but I have no idea how to make the magic! Someone can help me or tell where to find a good resource to learn the principles? Is it possible or there are easier workaround? Thanks in advance! The original recipe: Code:
__author__ = 'Marco Saraceno' __copyright__ = '2010, Marco Saraceno <marcosaraceno at gmail.com>' description = 'Italian daily newspaper - v 1.1 (Mar14,2011)' ''' http://www.ilsole24ore.com ''' class IlSole24Ore(BasicNewsRecipe): __author__ = 'Marco Saraceno' description = 'Italian financial daily newspaper' cover_url = 'http://www.shopping24.ilsole24ore.com/ProductRelated/rds/img/logo_sole.gif' title = u'Il Sole 24 Ore' publisher = 'Gruppo editoriale GRUPPO 24ORE' category = 'News, politics, culture, economy, financial, Italian' language = 'it' timefmt = '[%a, %d %b, %Y]' oldest_article = 2 max_articles_per_feed = 100 use_embedded_content = False recursion = 10 extra_css = '.headline {font-size: x-large;} \n .fact { padding-top: 10pt }' remove_tags = [ dict(name='div', attrs={'class':['header','titolo']}), dict(name='table', attrs={'class':['footer1024','footerdown']}), ] feeds = [ (u'Notizie Italia', u'http://www.ilsole24ore.com/rss/notizie/italia.xml'), (u'Notizie Europa', u'http://www.ilsole24ore.com/rss/notizie/europa.xml'), (u'Notizie USA', u'http://www.ilsole24ore.com/rss/notizie/usa.xml'), (u'Notizie Americhe', u'http://www.ilsole24ore.com/rss/notizie/americhe.xml'), (u'Notizie Medio Oriente e Africa', u'http://www.ilsole24ore.com/rss/notizie/medio-oriente-e-africa.xml'), (u'Notizie Asia e Oceania', u'http://www.ilsole24ore.com/rss/notizie/asia-e-oceania.xml'), (u'Commenti', u'http://www.ilsole24ore.com/rss/commenti-e-idee.xml'), (u'Norme e tributi', u'http://www.ilsole24ore.com/rss/norme-e-tributi.xml'), (u'Finanza', u'http://www.ilsole24ore.com/rss/finanza-e-mercati.xml'), (u'Economia', u'http://www.ilsole24ore.com/rss/economia.xml'), (u'Tecnologia', u'http://www.ilsole24ore.com/rss/tecnologie.xml'), (u'Cultura', u'http://www.ilsole24ore.com/rss/cultura.xml'), ] def print_version(self, url): return url.replace('.shtml', '_PRN.shtml') |
03-29-2011, 10:28 AM | #2 |
creator of calibre
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You might find it easier to extract the correct link from the descriptions in the RSS feed. To do that implement the get_article_url function in your recipe.
Something like from calibre import browser Code:
def get_article_url(self, article): original_url = BasicNewsRecipe.get_article_url(self, article) raw = browser().open_novisit(original_url).read() soup = BeautifulSoup(raw) # Find the link to the actual article in the soup and return that Last edited by kovidgoyal; 03-29-2011 at 10:35 AM. |
03-29-2011, 12:45 PM | #3 |
Member
Posts: 11
Karma: 10
Join Date: Mar 2011
Device: kindle 3
|
Thank you for your prompt reply. I try right away!
Btw, where can I find a good howto to coding recipes and the syntax of all the functions? I am a newbe :-( |
03-29-2011, 01:52 PM | #4 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
http://calibre-ebook.com/user_manual/news.html Also, see the links at the end of that page. |
|
03-29-2011, 07:00 PM | #5 | |
Member
Posts: 11
Karma: 10
Join Date: Mar 2011
Device: kindle 3
|
Quote:
Here the log: Code:
esolved conversion options calibre version: 0.7.50 {'asciiize': False, 'author_sort': None, 'authors': None, 'base_font_size': 0, 'book_producer': None, 'change_justification': 'original', 'chapter': None, 'chapter_mark': 'pagebreak', 'comments': None, 'cover': None, 'debug_pipeline': None, 'dehyphenate': True, 'delete_blank_paragraphs': True, 'disable_font_rescaling': False, 'dont_download_recipe': False, 'enable_heuristics': False, 'extra_css': None, 'fix_indents': True, 'font_size_mapping': None, 'format_scene_breaks': True, 'html_unwrap_factor': 0.4, 'input_encoding': None, 'input_profile': <calibre.customize.profiles.InputProfile object at 0x1087c0510>, 'insert_blank_line': False, 'insert_metadata': False, 'isbn': None, 'italicize_common_cases': True, 'keep_ligatures': False, 'language': None, 'level1_toc': None, 'level2_toc': None, 'level3_toc': None, 'line_height': 0, 'linearize_tables': False, 'lrf': False, 'margin_bottom': 5.0, 'margin_left': 5.0, 'margin_right': 5.0, 'margin_top': 5.0, 'markup_chapter_headings': True, 'max_toc_links': 50, 'minimum_line_height': 120.0, 'no_chapters_in_toc': False, 'no_inline_navbars': False, 'output_profile': <calibre.customize.profiles.OutputProfile object at 0x1087c08d0>, 'page_breaks_before': None, 'password': None, 'prefer_metadata_cover': False, 'pretty_print': True, 'pubdate': None, 'publisher': None, 'rating': None, 'read_metadata_from_opf': None, 'remove_first_image': False, 'remove_paragraph_spacing': False, 'remove_paragraph_spacing_indent_size': 1.5, 'renumber_headings': True, 'replace_scene_breaks': '', 'series': None, 'series_index': None, 'smarten_punctuation': False, 'sr1_replace': '', 'sr1_search': '', 'sr2_replace': '', 'sr2_search': '', 'sr3_replace': '', 'sr3_search': '', 'tags': None, 'test': True, 'timestamp': None, 'title': None, 'title_sort': None, 'toc_filter': None, 'toc_threshold': 6, 'unwrap_lines': True, 'use_auto_toc': False, 'username': None, 'verbose': 2} 1% Conversione dell'input in HTML... InputFormatPlugin: Recipe Input running 1% Scaricamento feed... 1% Scaricamento feed Notizie Italia... 1% Scaricamento feed Notizie Europa... 1% Tentativo di scaricamento della copertina... 34% Scaricamento copertina da http://www.shopping24.ilsole24ore.com/ProductRelated/rds/img/logo_sole.gif 1% Preparazione dell'immagine principale in corso Synthesizing mastheadImage 1% Inizio scaricamento [4 articoli]... 34% Feed scaricati in /private/var/folders/2u/2uk5pqalE4KW7TyOCw848k+++TI/-Tmp-/calibre_0.7.50_tmp_vNAf0b/calibre_0.7.50_s70ldH_plumber/index.html 34% Scaricamento completato Parsing all content... Parsing index.html ... Forcing index.html into XHTML namespace Parsing feed_0/index.html ... Initial parse failed: Traceback (most recent call last): File "site-packages/calibre/ebooks/oeb/base.py", line 886, in first_pass File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48634) File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:72245) File "parser.pxi", line 1417, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:71041) File "parser.pxi", line 898, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:67581) File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64257) File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65178) File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64521) XMLSyntaxError: Opening and ending tag mismatch: hr line 30 and div, line 31, column 7 Parsing file 'feed_0/index.html' as HTML Forcing feed_0/index.html into XHTML namespace Parsing feed_1/index.html ... Initial parse failed: Traceback (most recent call last): File "site-packages/calibre/ebooks/oeb/base.py", line 886, in first_pass File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48634) File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:72245) File "parser.pxi", line 1417, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:71041) File "parser.pxi", line 898, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:67581) File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64257) File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65178) File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64521) XMLSyntaxError: Opening and ending tag mismatch: hr line 30 and div, line 31, column 7 Parsing file 'feed_1/index.html' as HTML Forcing feed_1/index.html into XHTML namespace Reading TOC from NCX... 34% Transcodifica di un ebook in corso... Merging user specified metadata... Detecting structure... Flattening CSS and remapping font sizes... Source base font size is 12.00000pt Cleaning up manifest... Trimming unused files from manifest... Parsing stylesheet.css ... Creating OEB Output... 67% Creazione in corso OEB Output The cover image has an id != "cover". Renaming to work around bug in Nook Color OEB output written to /Users/marco/Dropbox/Public/Ricette Calibre/output_dir Output salvato in /Users/marco/Dropbox/Public/Ricette Calibre/output_dir Mac-mini-di-Marco-Saraceno:Ricette Calibre marco$ Code:
__author__ = 'Marco Saraceno' __copyright__ = '2010, Marco Saraceno <marcosaraceno at gmail.com>' description = 'Italian daily newspaper - v 1.1 (Mar14,2011)' ''' http://www.ilsole24ore.com ''' from calibre import browser from calibre.web.feeds.recipes import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup class IlSole24Ore(BasicNewsRecipe): __author__ = 'Marco Saraceno' description = 'Italian financial daily newspaper' cover_url = 'http://www.shopping24.ilsole24ore.com/ProductRelated/rds/img/logo_sole.gif' title = u'Il Sole 24 Ore' publisher = 'Gruppo editoriale GRUPPO 24ORE' category = 'News, politics, culture, economy, financial, Italian' language = 'it' timefmt = '[%a, %d %b, %Y]' oldest_article = 2 max_articles_per_feed = 100 use_embedded_content = False extra_css = '.headline {font-size: x-large;} \n .fact { padding-top: 10pt }' remove_tags = [ dict(name='div', attrs={'class':['header','titolo']}), dict(name='table', attrs={'class':['footer1024','footerdown']}), ] def get_article_url(self, article): original_url = BasicNewsRecipe.get_article_url(self, article) raw = browser().open_novisit(original_url).read() soup = BeautifulSoup(raw) # Find the link to the actual article in the soup and return that feeds = [ (u'Notizie Italia', u'http://www.ilsole24ore.com/rss/notizie/italia.xml'), (u'Notizie Europa', u'http://www.ilsole24ore.com/rss/notizie/europa.xml'), (u'Notizie USA', u'http://www.ilsole24ore.com/rss/notizie/usa.xml'), (u'Notizie Americhe', u'http://www.ilsole24ore.com/rss/notizie/americhe.xml'), (u'Notizie Medio Oriente e Africa', u'http://www.ilsole24ore.com/rss/notizie/medio-oriente-e-africa.xml'), (u'Notizie Asia e Oceania', u'http://www.ilsole24ore.com/rss/notizie/asia-e-oceania.xml'), (u'Commenti', u'http://www.ilsole24ore.com/rss/commenti-e-idee.xml'), (u'Norme e tributi', u'http://www.ilsole24ore.com/rss/norme-e-tributi.xml'), (u'Finanza', u'http://www.ilsole24ore.com/rss/finanza-e-mercati.xml'), (u'Economia', u'http://www.ilsole24ore.com/rss/economia.xml'), (u'Tecnologia', u'http://www.ilsole24ore.com/rss/tecnologie.xml'), (u'Cultura', u'http://www.ilsole24ore.com/rss/cultura.xml'), ] |
|
03-31-2011, 08:15 AM | #6 | |
Member
Posts: 11
Karma: 10
Join Date: Mar 2011
Device: kindle 3
|
Taking a look into the feed code I figure out the link to the former article (in bold) is present, but it is encoded (I guess). Any idea how can I grab it?
Thanks Quote:
|
|
03-31-2011, 11:17 AM | #7 |
creator of calibre
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
from urrlib import decode
decode(encoded_url) |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Recipe works when mocked up as Python file, fails when converted to Recipe | ode | Recipes | 7 | 09-04-2011 04:57 AM |
FIX: New York Times Recipe | bcollier | Recipes | 2 | 08-25-2011 11:31 AM |
PRS-950 They can't fix it so I can't keep it | JakesFriend | Sony Reader | 43 | 03-03-2011 10:03 PM |
how to fix this error? | themayfairwitch | ePub | 2 | 01-22-2011 03:11 PM |
FIX: La Vanguardia Recipe | fms | Recipes | 0 | 01-19-2011 06:22 AM |