View Single Post
Old 06-22-2010, 03:12 PM   #2186
gambarini
Connoisseur
gambarini began at the beginning.
 
Posts: 98
Karma: 22
Join Date: Mar 2010
Device: IRiver Story, Ipod Touch, Android SmartPhone
Quote:
Originally Posted by gambarini View Post
With this feed i have tried two ways, and every one has is pro and cons...

With get.article i can obtain the correct link, but i can't find the title of the article.
With the parse_index ( index_to_soup) i can find the correct "title" but i don't get the link (in the soup there is a malformed "link" tag)
an example of index to soup
Spoiler:

Code:
<item>
<title><![CDATA[Berlusconi: "Siamo il Paese 
più ricco d'Europa"]]></title>
<description><![CDATA[ROMA<BR>Il Premier Silvio Berlusconi continua a confidare su un forte consenso popolare alla sua persona e al suo governo, a dispetto «di tutto il fango che ci buttano addosso». E inivita il centrodestra a «non farsi del male in casa», apprpoffitando semmai di una opposizione che descrive pressochè inesistente. «Nonostante tutto il fango che tentano di buttarci addosso - dice nel suo collegamento  ...(continua)]]></description>
<author><![CDATA[ ]]></author>
<category><![CDATA[POLITICA]]></category>
<pubdate><![CDATA[Sun, 20 Jun 2010 13:34:37 +0200]]></pubdate>
<link />http://www.lastampa.it/redazione/cmsSezioni/politica/201006articoli/56066girata.asp
<enclosure url="http://www.lastampa.it/redazione/cmssezioni/politica/201006images/berlusconi01g.jpg" type="image/jpeg">
<image>
<url>http://www.lastampa.it/redazione/cmssezioni/politica/201006images/berlusconi01g.jpg</url>
<title></title>
<link />
<width></width>
<height></height>
</image>

So is there the possibility to use both solutions together?
Or is there the possibility to extract the link near the malformet tag <link /> ???


p.s.

probably the bug is related to the feed
Spoiler:

Code:
Parsing index.html ...
Initial parse failed:
Traceback (most recent call last):
  File "site-packages\calibre\ebooks\oeb\base.py", line 813, in first_pass
  File "lxml.etree.pyx", line 2538, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48266)
  File "parser.pxi", line 1536, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71653)
  File "parser.pxi", line 1408, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70449)
  File "parser.pxi", line 898, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:67144)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63820)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64741)
  File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64084)
XMLSyntaxError: Opening and ending tag mismatch: img line 29 and p, line 29, column 27
ok, this is my solution; i don't use the feed but i try to obtain link directly from the html section of the site.
So this is the code (beta version )

Spoiler:

Code:
 def parse_index(self):
    feeds = []
    for title, url in [
             ("Politica", "http://www.lastampa.it/_web/CMSTP/tmplSezioni/POLITICA/politicaHP.asp")
            ]:

            soup = self.index_to_soup(url)
            soup = soup.find(attrs={'class':'sezione'})

            articles = []

            for article in soup.findAllNext(attrs={'class':'titolo'}):
                title_url = self.tag_to_string(article)
                link = article.get('href', False)
                
                date = ''
                description = ''
#                link = article.link
#                link = article.find ('link />')
                     
                if title_url:
                   articles.append({'title': title_url, 'url': link,'description':'', 'date':date}),


            if articles:
               feeds.append((title, articles))

            return feeds
gambarini is offline