MobileRead Forums - View Single Post

limnoski · 01-10-2011, 11:23 AM

So I have had a go at setting up a recipe for Cracked.com as I can't see any existing one out there.

The Cracked.com website is a messy piece of work and I can't for the life of me get it to work. The main problem is I can't get the second pages to append to the first pages and I can't seem to figure out how to get rid of all the tables so it breaks it up into pages for me to read on my Sony PRS-650. I used the Adventure Gamers recipe and modified it only a little bit:

Code:

class Cracked(BasicNewsRecipe):
    title                 = u'Cracked.com'
    language              = 'en'
    description           = 'Americas Only Humor and Video Site, since 1958'
    publisher             = 'Cracked'
    category              = 'comedy, lists'
    oldest_article        = 7
    delay                 = 10
    max_articles_per_feed = 50
    no_stylesheets        = True
    encoding              = 'cp1252'
    remove_javascript     = True
    use_embedded_content  = False
    INDEX                 = u'http://www.cracked.com'
    extra_css             = """
                                .pageheader_type{font-size: x-large; font-weight: bold; color: #828D74}
                                .pageheader_title{font-size: xx-large; color: #394128}
                                .pageheader_byline{font-size: small; font-weight: bold; color: #394128}
                                .score_bg {display: inline; width: 100%; margin-bottom: 2em}
                                .score_column_1{ padding-left: 10px; font-size: small; width: 50%}
                                .score_column_2{ padding-left: 10px; font-size: small; width: 50%}
                                .score_column_3{ padding-left: 10px; font-size: small; width: 50%}
                                .score_header{font-size: large; color: #50544A}
                                .bodytext{display: block}
                                body{font-family: Helvetica,Arial,sans-serif}
                            """

    conversion_options = {
                          'comment'   : description
                        , 'tags'      : category
                        , 'publisher' : publisher
                        , 'language'  : language
                        }

    keep_only_tags    = [
                       dict(name='div', attrs={'class':['Column1']})                  
                        ]

    remove_tags = [
                       dict(name='div', attrs={'id':['googlead_1','fb-like-article','comments_section']})
	   ,dict(name='div', attrs={'class':['share_buttons_col_1','GenericModule1']})
	   ,dict(name = 'ul', attrs={'class':['Nav6']})
                        ]

    remove_tags_after = [dict(name='div', attrs={'id':'fb-like-article'})]
    remove_attributes = ['width','height']

    feeds = [(u'Articles', u'http://feeds.feedburner.com/CrackedRSS')]

    def get_article_url(self, article):
        return article.get('guid',  None)

    def append_page(self, soup, appendtag, position):
        pager = soup.find('li',attrs={'class':'forward'})
        if pager:
           nexturl = self.INDEX + pager.a['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'class':'Column1'})
           for it in texttag.findAll(style=True):
               del it['style']
           newpos = len(texttag.contents)
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)

    def preprocess_html(self, soup):
        for item in soup.findAll(style=True):
            del item['style']
        self.append_page(soup, soup.body, 3)
        pager = soup.find('div',attrs={'class':'prev_next'})
        if pager:
           pager.extract()
        return self.adeify_images(soup)

If anyone could help me that would be great.

I am also having problems with 'The Onion' recipe. It seems that there is something in the code for that site that crashes my Sony PRS-650. I read somewhere else that this is a firmware problem but you can just remove whatever bit of code is causing the problem. Does anyone know what the bit of HTML in 'The Onion' site is that might be causing the problem? And how do I actually get rid of it using the recipe that comes with Calibre?

I can get all my serious news fine but if I want to have a light read of some comedy I seem to be out of luck!