Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 03-29-2011, 08:49 AM   #1
bosplans
Member
bosplans began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Mar 2011
Device: kindle 3
Fix a recipe

Hi,

The recipe I made some month ago is not working anymore, since the newspaper implemented a new feed service called "feedportal.com" which screw the links to the proper articles, forwarding the visitor to ads ...

I figure out how to solve the problem in theory, but I do not know regex, unfortunately. The idea was to use the def print_version(self, url) and convert the following link from:

http://rss.feedsportal.com/c/32276/f...24ore0N0Cart0Cnotizie0C20A110E0A30E290Clampedusa0Eabitanti0Eoccupano0Emunicipio0E11570A60Bshtml0Duuid0FAauakRKD/story01.htm

To:

http://www.ilsole24ore.com/art/notizie/2011-03-29/lampedusa-abitanti-occupano-municipio-115706_PRN.shtml

As you can see in the former article link there are all the info needed for the conversion ... but I have no idea how to make the magic! Someone can help me or tell where to find a good resource to learn the principles? Is it possible or there are easier workaround?

Thanks in advance!

The original recipe:
Code:
__author__    = 'Marco Saraceno'
__copyright__ = '2010, Marco Saraceno <marcosaraceno at gmail.com>'
description   = 'Italian daily newspaper - v 1.1 (Mar14,2011)'

'''
http://www.ilsole24ore.com
'''

class IlSole24Ore(BasicNewsRecipe):
    __author__        = 'Marco Saraceno'
    description   = 'Italian financial daily newspaper'

    cover_url      = 'http://www.shopping24.ilsole24ore.com/ProductRelated/rds/img/logo_sole.gif'
    title          = u'Il Sole 24 Ore'
    publisher      = 'Gruppo editoriale GRUPPO 24ORE'
    category       = 'News, politics, culture, economy, financial, Italian'

    language       = 'it'
    timefmt        = '[%a, %d %b, %Y]'

    oldest_article = 2
    max_articles_per_feed = 100
    use_embedded_content  = False
    recursion             = 10
    extra_css      = '.headline {font-size: x-large;} \n .fact { padding-top: 10pt  }'

         
    remove_tags = [
                            dict(name='div', attrs={'class':['header','titolo']}),
                            dict(name='table', attrs={'class':['footer1024','footerdown']}),
                           ]

    feeds = [
                  (u'Notizie Italia', u'http://www.ilsole24ore.com/rss/notizie/italia.xml'),
				  (u'Notizie Europa', u'http://www.ilsole24ore.com/rss/notizie/europa.xml'),
				  (u'Notizie USA', u'http://www.ilsole24ore.com/rss/notizie/usa.xml'),
				  (u'Notizie Americhe', u'http://www.ilsole24ore.com/rss/notizie/americhe.xml'),
				  (u'Notizie Medio Oriente e Africa', u'http://www.ilsole24ore.com/rss/notizie/medio-oriente-e-africa.xml'),
				  (u'Notizie Asia e Oceania', u'http://www.ilsole24ore.com/rss/notizie/asia-e-oceania.xml'),
                  (u'Commenti', u'http://www.ilsole24ore.com/rss/commenti-e-idee.xml'),
                  (u'Norme e tributi', u'http://www.ilsole24ore.com/rss/norme-e-tributi.xml'),
                  (u'Finanza', u'http://www.ilsole24ore.com/rss/finanza-e-mercati.xml'),
                  (u'Economia', u'http://www.ilsole24ore.com/rss/economia.xml'),
                  (u'Tecnologia', u'http://www.ilsole24ore.com/rss/tecnologie.xml'),
                  (u'Cultura', u'http://www.ilsole24ore.com/rss/cultura.xml'),
                ]

    def print_version(self, url):
          return url.replace('.shtml', '_PRN.shtml')
bosplans is offline   Reply With Quote
Old 03-29-2011, 10:28 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,416
Karma: 4961459
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You might find it easier to extract the correct link from the descriptions in the RSS feed. To do that implement the get_article_url function in your recipe.

Something like

from calibre import browser

Code:
def get_article_url(self, article):
   original_url = BasicNewsRecipe.get_article_url(self, article)
   raw = browser().open_novisit(original_url).read()
   soup = BeautifulSoup(raw)
   # Find the link to the actual article in the soup and return that

Last edited by kovidgoyal; 03-29-2011 at 10:35 AM.
kovidgoyal is offline   Reply With Quote
 
Enthusiast
Old 03-29-2011, 12:45 PM   #3
bosplans
Member
bosplans began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Mar 2011
Device: kindle 3
Thank you for your prompt reply. I try right away!
Btw, where can I find a good howto to coding recipes and the syntax of all the functions? I am a newbe :-(
bosplans is offline   Reply With Quote
Old 03-29-2011, 01:52 PM   #4
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by bosplans View Post
where can I find a good howto to coding recipes and the syntax of all the functions?
Start here:
http://calibre-ebook.com/user_manual/news.html
Also, see the links at the end of that page.
Starson17 is offline   Reply With Quote
Old 03-29-2011, 07:00 PM   #5
bosplans
Member
bosplans began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Mar 2011
Device: kindle 3
Quote:
Originally Posted by kovidgoyal View Post
You might find it easier to extract the correct link from the descriptions in the RSS feed. To do that implement the get_article_url function in your recipe.

Something like

from calibre import browser

Code:
def get_article_url(self, article):
   original_url = BasicNewsRecipe.get_article_url(self, article)
   raw = browser().open_novisit(original_url).read()
   soup = BeautifulSoup(raw)
   # Find the link to the actual article in the soup and return that
I tried different implementation of the get_article_url but no one worked out... looks like it can not find the real link to the article...

Here the log:
Code:
esolved conversion options
calibre version: 0.7.50
{'asciiize': False,
 'author_sort': None,
 'authors': None,
 'base_font_size': 0,
 'book_producer': None,
 'change_justification': 'original',
 'chapter': None,
 'chapter_mark': 'pagebreak',
 'comments': None,
 'cover': None,
 'debug_pipeline': None,
 'dehyphenate': True,
 'delete_blank_paragraphs': True,
 'disable_font_rescaling': False,
 'dont_download_recipe': False,
 'enable_heuristics': False,
 'extra_css': None,
 'fix_indents': True,
 'font_size_mapping': None,
 'format_scene_breaks': True,
 'html_unwrap_factor': 0.4,
 'input_encoding': None,
 'input_profile': <calibre.customize.profiles.InputProfile object at 0x1087c0510>,
 'insert_blank_line': False,
 'insert_metadata': False,
 'isbn': None,
 'italicize_common_cases': True,
 'keep_ligatures': False,
 'language': None,
 'level1_toc': None,
 'level2_toc': None,
 'level3_toc': None,
 'line_height': 0,
 'linearize_tables': False,
 'lrf': False,
 'margin_bottom': 5.0,
 'margin_left': 5.0,
 'margin_right': 5.0,
 'margin_top': 5.0,
 'markup_chapter_headings': True,
 'max_toc_links': 50,
 'minimum_line_height': 120.0,
 'no_chapters_in_toc': False,
 'no_inline_navbars': False,
 'output_profile': <calibre.customize.profiles.OutputProfile object at 0x1087c08d0>,
 'page_breaks_before': None,
 'password': None,
 'prefer_metadata_cover': False,
 'pretty_print': True,
 'pubdate': None,
 'publisher': None,
 'rating': None,
 'read_metadata_from_opf': None,
 'remove_first_image': False,
 'remove_paragraph_spacing': False,
 'remove_paragraph_spacing_indent_size': 1.5,
 'renumber_headings': True,
 'replace_scene_breaks': '',
 'series': None,
 'series_index': None,
 'smarten_punctuation': False,
 'sr1_replace': '',
 'sr1_search': '',
 'sr2_replace': '',
 'sr2_search': '',
 'sr3_replace': '',
 'sr3_search': '',
 'tags': None,
 'test': True,
 'timestamp': None,
 'title': None,
 'title_sort': None,
 'toc_filter': None,
 'toc_threshold': 6,
 'unwrap_lines': True,
 'use_auto_toc': False,
 'username': None,
 'verbose': 2}
1% Conversione dell'input in HTML...
InputFormatPlugin: Recipe Input running
1% Scaricamento feed...
1% Scaricamento feed Notizie Italia...
1% Scaricamento feed Notizie Europa...
1% Tentativo di scaricamento della copertina...
34% Scaricamento copertina da http://www.shopping24.ilsole24ore.com/ProductRelated/rds/img/logo_sole.gif
1% Preparazione dell'immagine principale in corso
Synthesizing mastheadImage
1% Inizio scaricamento [4 articoli]...
34% Feed scaricati in /private/var/folders/2u/2uk5pqalE4KW7TyOCw848k+++TI/-Tmp-/calibre_0.7.50_tmp_vNAf0b/calibre_0.7.50_s70ldH_plumber/index.html
34% Scaricamento completato
Parsing all content...
Parsing index.html ...
Forcing index.html into XHTML namespace
Parsing feed_0/index.html ...
Initial parse failed:
Traceback (most recent call last):
  File "site-packages/calibre/ebooks/oeb/base.py", line 886, in first_pass
  File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48634)
  File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:72245)
  File "parser.pxi", line 1417, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:71041)
  File "parser.pxi", line 898, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:67581)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64257)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65178)
  File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64521)
XMLSyntaxError: Opening and ending tag mismatch: hr line 30 and div, line 31, column 7

Parsing file 'feed_0/index.html' as HTML
Forcing feed_0/index.html into XHTML namespace
Parsing feed_1/index.html ...
Initial parse failed:
Traceback (most recent call last):
  File "site-packages/calibre/ebooks/oeb/base.py", line 886, in first_pass
  File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48634)
  File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:72245)
  File "parser.pxi", line 1417, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:71041)
  File "parser.pxi", line 898, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:67581)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64257)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65178)
  File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64521)
XMLSyntaxError: Opening and ending tag mismatch: hr line 30 and div, line 31, column 7

Parsing file 'feed_1/index.html' as HTML
Forcing feed_1/index.html into XHTML namespace
Reading TOC from NCX...
34% Transcodifica di un ebook in corso...
Merging user specified metadata...
Detecting structure...
Flattening CSS and remapping font sizes...
Source base font size is 12.00000pt
Cleaning up manifest...
Trimming unused files from manifest...
Parsing stylesheet.css ...
Creating OEB Output...
67% Creazione in corso OEB Output
The cover image has an id != "cover". Renaming to work around bug in Nook Color
OEB output written to /Users/marco/Dropbox/Public/Ricette Calibre/output_dir
Output salvato in   /Users/marco/Dropbox/Public/Ricette Calibre/output_dir
Mac-mini-di-Marco-Saraceno:Ricette Calibre marco$
And here the recipe code last used:
Code:
__author__    = 'Marco Saraceno'
__copyright__ = '2010, Marco Saraceno <marcosaraceno at gmail.com>'
description   = 'Italian daily newspaper - v 1.1 (Mar14,2011)'

'''
http://www.ilsole24ore.com
'''
from calibre import browser
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup
class IlSole24Ore(BasicNewsRecipe):
    __author__        = 'Marco Saraceno'
    description   = 'Italian financial daily newspaper'

    cover_url      = 'http://www.shopping24.ilsole24ore.com/ProductRelated/rds/img/logo_sole.gif'
    title          = u'Il Sole 24 Ore'
    publisher      = 'Gruppo editoriale GRUPPO 24ORE'
    category       = 'News, politics, culture, economy, financial, Italian'

    language       = 'it'
    timefmt        = '[%a, %d %b, %Y]'

    oldest_article = 2
    max_articles_per_feed = 100
    use_embedded_content  = False
    extra_css      = '.headline {font-size: x-large;} \n .fact { padding-top: 10pt  }'

         
    remove_tags = [
                            dict(name='div', attrs={'class':['header','titolo']}),
                            dict(name='table', attrs={'class':['footer1024','footerdown']}),
                            ]
                            

    def get_article_url(self, article):
        original_url = BasicNewsRecipe.get_article_url(self, article)
        raw = browser().open_novisit(original_url).read()
        soup = BeautifulSoup(raw)
        # Find the link to the actual article in the soup and return that

    feeds = [
                  (u'Notizie Italia', u'http://www.ilsole24ore.com/rss/notizie/italia.xml'),
                  (u'Notizie Europa', u'http://www.ilsole24ore.com/rss/notizie/europa.xml'),
                  (u'Notizie USA', u'http://www.ilsole24ore.com/rss/notizie/usa.xml'),
                  (u'Notizie Americhe', u'http://www.ilsole24ore.com/rss/notizie/americhe.xml'),
                  (u'Notizie Medio Oriente e Africa', u'http://www.ilsole24ore.com/rss/notizie/medio-oriente-e-africa.xml'),
                  (u'Notizie Asia e Oceania', u'http://www.ilsole24ore.com/rss/notizie/asia-e-oceania.xml'),
                  (u'Commenti', u'http://www.ilsole24ore.com/rss/commenti-e-idee.xml'),
                  (u'Norme e tributi', u'http://www.ilsole24ore.com/rss/norme-e-tributi.xml'),
                  (u'Finanza', u'http://www.ilsole24ore.com/rss/finanza-e-mercati.xml'),
                  (u'Economia', u'http://www.ilsole24ore.com/rss/economia.xml'),
                  (u'Tecnologia', u'http://www.ilsole24ore.com/rss/tecnologie.xml'),
                  (u'Cultura', u'http://www.ilsole24ore.com/rss/cultura.xml'),
                  ]
Any suggestion?
bosplans is offline   Reply With Quote
Old 03-31-2011, 08:15 AM   #6
bosplans
Member
bosplans began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Mar 2011
Device: kindle 3
Taking a look into the feed code I figure out the link to the former article (in bold) is present, but it is encoded (I guess). Any idea how can I grab it?
Thanks

Quote:
<div class="mf-viral">
<table border="0">
<tbody>
<tr>
<td valign="middle">
<a href="http://res.feedsportal.com/viral/sendemail2_it.html?title=Chi+guadagna+e+chi+perde+ tra+le+Regioni+con+il+piano+immigrati&link=http%3A %2F%2Fwww.ilsole24ore.com%2Fart%2Fnotizie%2F2011-03-31%2Fecco-come-sono-suddivisi-121704.shtml%3Fuuid%3DAaCRM1KD" target="_blank">
<img src="http://rss.feedsportal.com/images/emailthis2_it.gif" border="0"/>
</a>
</td>
<td valign="middle">
<a href="http://res.feedsportal.com/viral/bookmark_it.cfm?title=Chi+guadagna+e+chi+perde+tra +le+Regioni+con+il+piano+immigrati&link=http%3A%2F%2Fwww.ilsole24ore.com%2Fart%2Fnotizie%2 F2011-03-31%2Fecco-come-sono-suddivisi-121704.shtml%3Fuuid%3DAaCRM1KD" target="_blank">
<img src="http://rss.feedsportal.com/images/bookmark_it.gif" border="0"/>
</a>
</td>
</tr>
</tbody>
</table>
</div>
bosplans is offline   Reply With Quote
Old 03-31-2011, 11:17 AM   #7
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,416
Karma: 4961459
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
from urrlib import decode

decode(encoded_url)
kovidgoyal is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Recipe works when mocked up as Python file, fails when converted to Recipe ode Recipes 7 09-04-2011 04:57 AM
FIX: New York Times Recipe bcollier Recipes 2 08-25-2011 11:31 AM
PRS-950 They can't fix it so I can't keep it JakesFriend Sony Reader 43 03-03-2011 10:03 PM
how to fix this error? themayfairwitch ePub 2 01-22-2011 03:11 PM
FIX: La Vanguardia Recipe fms Recipes 0 01-19-2011 06:22 AM


All times are GMT -4. The time now is 10:48 AM.


MobileRead.com is a privately owned, operated and funded community.