View Single Post
Old 08-26-2010, 10:15 AM   #2528
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by naisren View Post
My recipe is
Spoiler:
Code:
import re
from calibre.web.feeds.news import BasicNewsRecipe

class VOA(BasicNewsRecipe):

    title      = 'VOA News'
    __author__ = 'voa'
    description = 'VOA through 51'
    language = 'en'
    remove_javascript = True

    remove_tags_before = dict(id=['rightContainer'])
    remove_tags_after  = dict(id=['listads'])
    remove_tags        = [
                          dict(id=['contentAds']), dict(id=['playbar']), dict(id=['menubar']), 
                         ]    
    no_stylesheets = True
    extra_css = '''
                '''


    def parse_index(self):
        soup = self.index_to_soup('http://www.51voa.com/')
        feeds = []
        section = []
        title = None

       #for x in soup.find(id='list').findAll('a'):
        for x in soup.find(id='rightContainer').findAll('a'):
                if '/VOA_Special_English/' in x['href'] or '/VOA_Standard_English/' in x['href'] or '/VOA_Standard_English/' in x['href']:
                    article = {
                            'url' : 'http://www.51voa.com/' + x['href'],
                            'title' : self.tag_to_string(x),
                            'date': '',
                            'description': '',
                        }
                    section.append(article)

        feeds.append(('Newest', section))

        return feeds
OK, your problem was so interesting, I couldn't resist looking at it further. Your problem is the bad html code in your source page http://www.51voa.com/. The closing angle bracket of each opening tag is ' />', instead of ' >'. That results in each tag being closed twice. Beautiful Soup is confused and sees the entire page as a single [document] element having a single NavigableString of text, not as multiple tags, so none of the tag-based searches or manipulation commands will work. There are no tags for BeautifulSoup to find or work with.

To fix this, you first grab that page (as you have already done in your code):

Code:
        soup = self.index_to_soup('http://www.51voa.com/')
Then, grab the string that is in the contents of the big single [document] element and search and replace the bad closing brackets as follows:
Code:
        rawc = soup.contents[0].string.replace(' />', ' >')
Now it's fixed, but it's still text, so you next convert the string back into a BeautifulSoup object:
Code:
        soup = BeautifulSoup(rawc, fromEncoding=self.encoding)
(Also, add
Code:
from calibre.ebooks.BeautifulSoup import BeautifulSoup
to your recipe)

That's it! Put the two extra lines above after your first index_to_soup line. Be aware that any legitimate single element tags, such as <img>, <br> etc. will get mangled with the simple search and replace above. You may have to special case any tags that are allowed to have a closing slash inside the opening tag so they don't get mangled.

Edit: I forgot, you also need this line:
encoding = 'utf-8'
or else the final step will fail.

Last edited by Starson17; 08-26-2010 at 04:08 PM.
Starson17 is offline