Cracked.com and The Onion

limnoski · 01-10-2011, 11:23 AM

So I have had a go at setting up a recipe for Cracked.com as I can't see any existing one out there.

The Cracked.com website is a messy piece of work and I can't for the life of me get it to work. The main problem is I can't get the second pages to append to the first pages and I can't seem to figure out how to get rid of all the tables so it breaks it up into pages for me to read on my Sony PRS-650. I used the Adventure Gamers recipe and modified it only a little bit:

Code:

class Cracked(BasicNewsRecipe):
    title                 = u'Cracked.com'
    language              = 'en'
    description           = 'Americas Only Humor and Video Site, since 1958'
    publisher             = 'Cracked'
    category              = 'comedy, lists'
    oldest_article        = 7
    delay                 = 10
    max_articles_per_feed = 50
    no_stylesheets        = True
    encoding              = 'cp1252'
    remove_javascript     = True
    use_embedded_content  = False
    INDEX                 = u'http://www.cracked.com'
    extra_css             = """
                                .pageheader_type{font-size: x-large; font-weight: bold; color: #828D74}
                                .pageheader_title{font-size: xx-large; color: #394128}
                                .pageheader_byline{font-size: small; font-weight: bold; color: #394128}
                                .score_bg {display: inline; width: 100%; margin-bottom: 2em}
                                .score_column_1{ padding-left: 10px; font-size: small; width: 50%}
                                .score_column_2{ padding-left: 10px; font-size: small; width: 50%}
                                .score_column_3{ padding-left: 10px; font-size: small; width: 50%}
                                .score_header{font-size: large; color: #50544A}
                                .bodytext{display: block}
                                body{font-family: Helvetica,Arial,sans-serif}
                            """

    conversion_options = {
                          'comment'   : description
                        , 'tags'      : category
                        , 'publisher' : publisher
                        , 'language'  : language
                        }

    keep_only_tags    = [
                       dict(name='div', attrs={'class':['Column1']})                  
                        ]

    remove_tags = [
                       dict(name='div', attrs={'id':['googlead_1','fb-like-article','comments_section']})
	   ,dict(name='div', attrs={'class':['share_buttons_col_1','GenericModule1']})
	   ,dict(name = 'ul', attrs={'class':['Nav6']})
                        ]

    remove_tags_after = [dict(name='div', attrs={'id':'fb-like-article'})]
    remove_attributes = ['width','height']

    feeds = [(u'Articles', u'http://feeds.feedburner.com/CrackedRSS')]

    def get_article_url(self, article):
        return article.get('guid',  None)

    def append_page(self, soup, appendtag, position):
        pager = soup.find('li',attrs={'class':'forward'})
        if pager:
           nexturl = self.INDEX + pager.a['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'class':'Column1'})
           for it in texttag.findAll(style=True):
               del it['style']
           newpos = len(texttag.contents)
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)

    def preprocess_html(self, soup):
        for item in soup.findAll(style=True):
            del item['style']
        self.append_page(soup, soup.body, 3)
        pager = soup.find('div',attrs={'class':'prev_next'})
        if pager:
           pager.extract()
        return self.adeify_images(soup)

If anyone could help me that would be great.

I am also having problems with 'The Onion' recipe. It seems that there is something in the code for that site that crashes my Sony PRS-650. I read somewhere else that this is a firmware problem but you can just remove whatever bit of code is causing the problem. Does anyone know what the bit of HTML in 'The Onion' site is that might be causing the problem? And how do I actually get rid of it using the recipe that comes with Calibre?

I can get all my serious news fine but if I want to have a light read of some comedy I seem to be out of luck!

Phoul · 01-12-2011, 02:50 AM

The onion recipe also crashes the prs-350, ive reported this elsewhere but it was never paid attention to...

Phoul · 01-13-2011, 02:42 PM

Anyone have any ideas on the onion?

Algiedi · 02-22-2011, 09:13 AM

This is very relevant to my interests.

Unfortunately, I have no programming knowledge whatsoever, and managing to clean up the structure and modify some css is the limit of my ability so far.

Please do let us know if you find the solution for Cracked.

kiklop74 · 02-22-2011, 12:04 PM

Onion is already fixed.

Nudgenudge · 03-26-2011, 08:03 PM

So, since I wanted Cracked.com too, I modified the recipe to have the working one:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
import re

class Cracked(BasicNewsRecipe):
    title                 = u'Cracked.com'
    language              = 'en'
    description            = 'America''s Only Humor and Video Site, since 1958'
    publisher             = 'Cracked'
    category              = 'comedy, lists'
    oldest_article        = 2
    delay                 = 10
    max_articles_per_feed = 2
    no_stylesheets        = True
    encoding              = 'cp1252'
    remove_javascript     = True
    use_embedded_content  = False
    INDEX                 = u'http://www.cracked.com'
    extra_css             = """
                                .pageheader_type{font-size: x-large; font-weight: bold; color: #828D74}
                                .pageheader_title{font-size: xx-large; color: #394128}
                                .pageheader_byline{font-size: small; font-weight: bold; color: #394128}
                                .score_bg {display: inline; width: 100%; margin-bottom: 2em}
                                .score_column_1{ padding-left: 10px; font-size: small; width: 50%}
                                .score_column_2{ padding-left: 10px; font-size: small; width: 50%}
                                .score_column_3{ padding-left: 10px; font-size: small; width: 50%}
                                .score_header{font-size: large; color: #50544A}
                                .bodytext{display: block}
                                body{font-family: Helvetica,Arial,sans-serif}
                            """

    conversion_options = {
                          'comment'   : description
                        , 'tags'      : category
                        , 'publisher' : publisher
                        , 'language'  : language
                        , 'linearize_tables' : True
                        }

    keep_only_tags    =  [
                        dict(name='div', attrs={'class':['Column1']})                  
                        ]

    feeds = [(u'Articles', u'http://feeds.feedburner.com/CrackedRSS')]

    def get_article_url(self, article):
        return article.get('guid',  None)

    def cleanup_page(self, soup):
        for item in soup.findAll(style=True):
            del item['style']
	    for alink in soup.findAll('a'):
	        if alink.string is not None:
	            tstr = alink.string
	            alink.replaceWith(tstr)
        for div_to_remove in soup.findAll('div', attrs={'id':['googlead_1','fb-like-article','comments_section']}):
            div_to_remove.extract()
        for div_to_remove in soup.findAll('div', attrs={'class':['share_buttons_col_1','GenericModule1']}):
            div_to_remove.extract()
        for div_to_remove in soup.findAll('div', attrs={'class':re.compile("prev_next")}):
            div_to_remove.extract()
        for ul_to_remove in soup.findAll('ul', attrs={'class':['Nav6']}):
            ul_to_remove.extract()
        for image in soup.findAll('img', attrs={'alt': 'article image'}):
            image.extract()

    def append_page(self, soup, appendtag, position):
        pager = soup.find('a',attrs={'class':'next_arrow_active'})
        if pager:
            nexturl = self.INDEX + pager['href']
            soup2 = self.index_to_soup(nexturl)
            texttag = soup2.find('div', attrs={'class':re.compile("userStyled")})
            newpos = len(texttag.contents)
            self.append_page(soup2,texttag,newpos)
            texttag.extract()
            self.cleanup_page(appendtag)
            appendtag.insert(position,texttag)
        else:
            self.cleanup_page(appendtag)

    def preprocess_html(self, soup):
        self.append_page(soup, soup.body, 3)
        return self.adeify_images(soup)

Since it took me a while to do it, here it is for other Cracked fans.

Shelleyleo · 04-19-2011, 03:51 AM

So...did this recipe break or is it just me? I tried the built-in, tried this one...and nada (okay, summary/blurb but no full article).

What I want and what it /looks/ like should happen is the starting menu, the articles menu/summary/blurb, then after that the actual articles in their entirety.

What I get: the starting menu, the summary/blurb bits, and "blank" pages except for header/footer from Calibre and a link to where the article was "downloaded" from.

If I follow the link shown, I go to the Cracked site and the article appears without any sort of errors or issues...but that's on my PC, with a browser and internet connection, I can't do that on my PRS-350.

I have no clue whatsoever about recipes, I've managed to modify a couple of builtins (mostly commenting/uncommenting the ones that are intended to be modified per user wants) but trying to work out what all is going on in this complex of one...so very not my skillset.

There are other recipes that I have this same issue with, but I'm still mucking about with them to see if I can easily work out the problem, but the recipes are all mostly uncomprehensible to me - even with the tutorials out there, I am not a programmer, I've tried to learn a number of programming "things" (languages, partial subsets of instructions, and so on) in my past and my brain just looks at it and goes..."interesting...what is it? It seems to be words but they make no sense" (and that's looking at the how-tos and guides, nevermind what code samples usually do to my brain)

Someone smarter than I am with programming and recipes...help? Please? I don't even know what info I should give you if it is a problem on my end somehow.

Starson17 · 04-19-2011, 10:13 AM

Quote:

Originally Posted by Shelleyleo

What I want and what it /looks/ like should happen is the starting menu, the articles menu/summary/blurb, then after that the actual articles in their entirety.

What I get: the starting menu, the summary/blurb bits, and "blank" pages except for header/footer from Calibre and a link to where the article was "downloaded" from.

This is normal behavior for a broken recipe The "feed" is standardized, so it almost always works no matter how much a site changes. The feed supplies "the starting menu" and "the summary/blurb bits." The articles vary widely, and as soon as the site changes, the articles disappear, leaving only a blank page and a Calibre-created link to the article source page.

Shelleyleo · 04-19-2011, 05:47 PM

Quote:

Originally Posted by Starson17

This is normal behavior for a broken recipe The "feed" is standardized, so it almost always works no matter how much a site changes. The feed supplies "the starting menu" and "the summary/blurb bits." The articles vary widely, and as soon as the site changes, the articles disappear, leaving only a blank page and a Calibre-created link to the article source page.

Lovely...is there a way to resolve it or is it pretty much destined to break everytime the feed changes/adds an article? I have found that a decent chunk of the built in recipes for news feeds I want to grab are doing this same thing, I get the preview blurbs that are part of the RSS feed itself but the complete articles are simply a "this article downloaded from /link here/" and no actual article text.

It seems that is /should/ be grabbing the articles as that is what a number of recipes grab and it appears that it is the intended behaviour, but for whatever reason they are broken in some way that means it isn't grabbing the full text of the articles - the question more importantly is...how do I fix it with literally no understanding of the advanced code portion of modifying a recipe? I know the typical answer to that is probably "learn the code enough to manipulate it" but I do wish to stress that it isn't that I don't /want/ to learn how to manipulate the code, but for whatever reason no matter how badly I wish to learn, my brain is like teflon when it comes to anything related to programming - the information I read never gets absorbed enough for me to make use ot it. And I do come from a technical background, so you'd think I'd at least comprehend the basic gist of it having been exposed to it so much.

Anyway, I'm having this issue with at least Salon, Jezebel, Cracked, and who knows what others I'd like to read if I could, so it would behoove me to figure out how to fix them myself, but barring that, is anyone willing to take a crack at em and see if they just need a simple sort of fix? If not, I'll cope, I haven't been reading them on my previous eReader until trying now, so it isn't like it is the end of the world if I can't have them work, I'm just all excited that I can read some of my news stuff on my new sony prs350 and I want to read more.

Starson17 · 04-20-2011, 08:55 AM

Quote:

Originally Posted by Shelleyleo

Lovely...is there a way to resolve it

Yes. The recipe has to be fixed. Hopefully the recipe author will do it. If not, you or anyone else can tackle it.

Quote:

or is it pretty much destined to break everytime the feed changes/adds an article?

You misunderstood me. The recipe doesn't break because the feed changes. It breaks because the structure of the web page that the feed points to has changed. Feeds are standardized quite well and the recipe system can almost always read them. It's the web pages that change drastically. Yes, we are all doomed to chasing the changes in the web pages. I have recipes I maintain change on me three times in a week.

Quote:

I have found that a decent chunk of the built in recipes for news feeds I want to grab are doing this same thing, I get the preview blurbs that are part of the RSS feed itself but the complete articles are simply a "this article downloaded from /link here/" and no actual article text.

Yes, many recipes are written by authors at the request of others. If the author doesn't read that recipe, he won't notice it has broken. You can PM the author to let him know. The author's name is in the recipe.

Shelleyleo · 04-20-2011, 04:00 PM

Quote:

Originally Posted by Starson17

Yes. The recipe has to be fixed. Hopefully the recipe author will do it. If not, you or anyone else can tackle it.

You misunderstood me. The recipe doesn't break because the feed changes. It breaks because the structure of the web page that the feed points to has changed. Feeds are standardized quite well and the recipe system can almost always read them. It's the web pages that change drastically. Yes, we are all doomed to chasing the changes in the web pages. I have recipes I maintain change on me three times in a week.

Yes, many recipes are written by authors at the request of others. If the author doesn't read that recipe, he won't notice it has broken. You can PM the author to let him know. The author's name is in the recipe.

Ahh, all good to know - thank you very much for your replies, you have been very helpful.

limnoski · 04-29-2011, 11:56 AM

Hi guys,

I have to say thanks to Nudge for getting this working in the first place. I managed to get a couple of downloads before they changed the website!
So I have tried again to get this recipe working, I thought it might be a simple case of changing the tags. However it looks like it is not so simple...

Here is as far as I have got with Nudge's recipe:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
import re

class Cracked(BasicNewsRecipe):
    title                 = u'Cracked.com'
    __author__            = u'Nudgenudge'
    language              = 'en'
    description            = 'America''s Only Humor and Video Site, since 1958'
    publisher             = 'Cracked'
    category              = 'comedy, lists'
    oldest_article        = 2
    delay                 = 10
    max_articles_per_feed = 2
    no_stylesheets        = True
    encoding              = 'cp1252'
    remove_javascript     = True
    use_embedded_content  = False
    INDEX                 = u'http://www.cracked.com'
    extra_css             = """
                                .pageheader_type{font-size: x-large; font-weight: bold; color: #828D74}
                                .pageheader_title{font-size: xx-large; color: #394128}
                                .pageheader_byline{font-size: small; font-weight: bold; color: #394128}
                                .score_bg {display: inline; width: 100%; margin-bottom: 2em}
                                .score_column_1{ padding-left: 10px; font-size: small; width: 50%}
                                .score_column_2{ padding-left: 10px; font-size: small; width: 50%}
                                .score_column_3{ padding-left: 10px; font-size: small; width: 50%}
                                .score_header{font-size: large; color: #50544A}
                                .bodytext{display: block}
                                body{font-family: Helvetica,Arial,sans-serif}
                            """

    conversion_options = {
                          'comment'   : description
                        , 'tags'      : category
                        , 'publisher' : publisher
                        , 'language'  : language
                        , 'linearize_tables' : True
                        }

    keep_only_tags    =  [
                        dict(name='section', attrs={'class':['body']})
                        ]

    feeds = [(u'Articles', u'http://feeds.feedburner.com/CrackedRSS')]

    def get_article_url(self, article):
        return article.get('guid',  None)

    def cleanup_page(self, soup):
        for item in soup.findAll(style=True):
            del item['style']
	    for alink in soup.findAll('a'):
	        if alink.string is not None:
	            tstr = alink.string
	            alink.replaceWith(tstr)
        for div_to_remove in soup.findAll('div', attrs={'id':['persistent-share','inline-share-buttons']}):
            div_to_remove.extract()
        for div_to_remove in soup.findAll('div', attrs={'class':['FacebookLike','shareBar']}):
            div_to_remove.extract()
        for nav_to_remove in soup.findAll('nav', attrs={'class':re.compile("PaginationContent")}):
            nav_to_remove.extract()
        for image in soup.findAll('img', attrs={'alt': 'article image'}):
            image.extract()

    def append_page(self, soup, appendtag, position):
        pager = soup.find('a',attrs={'class':'next'})
        if pager:
            nexturl = self.INDEX + pager['href']
            soup2 = self.index_to_soup(nexturl)
            texttag = soup2.find('Article', attrs={'class':re.compile("Article Module")})
            newpos = len(texttag.contents)
            self.append_page(soup2,texttag,newpos)
            texttag.extract()
            self.cleanup_page(appendtag)
            appendtag.insert(position,texttag)
        else:
            self.cleanup_page(appendtag)

    def preprocess_html(self, soup):
        self.append_page(soup, soup.body, 3)
        return self.adeify_images(soup)

I have also tried another variation on the recipe that uses the cracked website instead of the RSS feed, it has the same problem however. Here it is below if anyone is interested, the RSS feed is still much neater:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup
from calibre import strftime
import re

class Cracked(BasicNewsRecipe):
    title                 = u'Cracked.com'
    __author__            = u'Nudgenudge'
    language              = 'en'
    description            = 'America''s Only Humor and Video Site, since 1958'
    publisher             = 'Cracked'
    category              = 'comedy, lists'
    oldest_article        = 2
    delay                 = 10
    max_articles_per_feed = 2
    no_stylesheets        = True
    encoding              = 'cp1252'
    remove_javascript     = True
    use_embedded_content  = False
    INDEX                 = u'http://www.cracked.com'
    extra_css             = """
                                .pageheader_type{font-size: x-large; font-weight: bold; color: #828D74}
                                .pageheader_title{font-size: xx-large; color: #394128}
                                .pageheader_byline{font-size: small; font-weight: bold; color: #394128}
                                .score_bg {display: inline; width: 100%; margin-bottom: 2em}
                                .score_column_1{ padding-left: 10px; font-size: small; width: 50%}
                                .score_column_2{ padding-left: 10px; font-size: small; width: 50%}
                                .score_column_3{ padding-left: 10px; font-size: small; width: 50%}
                                .score_header{font-size: large; color: #50544A}
                                .bodytext{display: block}
                                body{font-family: Helvetica,Arial,sans-serif}
                            """

    conversion_options = {
                          'comment'   : description
                        , 'tags'      : category
                        , 'publisher' : publisher
                        , 'language'  : language
                        , 'linearize_tables' : True
                        }

    keep_only_tags    =  [
                        dict(name='section', attrs={'class':['body']})
                        ]

    def parse_index(self):
        articles = []
        rawc = self.index_to_soup('http://www.cracked.com/funny-articles.html',True)
        soup = BeautifulSoup(rawc,fromEncoding=self.encoding)
        

        for item in soup.findAll(attrs={'class':'content'}):
            description = ''
            title_prefix = ''
            feed_link = item.find('a',href=True)
            descript = item.find('a')
            if descript:
               description = self.tag_to_string(descript)
            if feed_link:
                url   = feed_link['href']
                title = title_prefix + self.tag_to_string(feed_link)
                date  = strftime(self.timefmt)
                articles.append({
                                  'title'      :title
                                 ,'date'       :date
                                 ,'url'        :url
                                 ,'description':description
                                })
        return [(self.tag_to_string(soup.find('title')), articles)]

    def get_article_url(self, article):
        return article.get('guid',  None)

    def cleanup_page(self, soup):
        for item in soup.findAll(style=True):
            del item['style']
	    for alink in soup.findAll('a'):
	        if alink.string is not None:
	            tstr = alink.string
	            alink.replaceWith(tstr)
        for div_to_remove in soup.findAll('div', attrs={'id':['persistent-share','inline-share-buttons']}):
            div_to_remove.extract()
        for div_to_remove in soup.findAll('div', attrs={'class':['FacebookLike','shareBar']}):
            div_to_remove.extract()
        for nav_to_remove in soup.findAll('nav', attrs={'class':re.compile("PaginationContent")}):
            nav_to_remove.extract()
        for image in soup.findAll('img', attrs={'alt': 'article image'}):
            image.extract()

    def append_page(self, soup, appendtag, position):
        pager = soup.find('a',attrs={'class':'next'})
        if pager:
            nexturl = self.INDEX + pager['href']
            soup2 = self.index_to_soup(nexturl)
            texttag = soup2.find('Article', attrs={'class':re.compile("Article Module")})
            newpos = len(texttag.contents)
            self.append_page(soup2,texttag,newpos)
            texttag.extract()
            self.cleanup_page(appendtag)
            appendtag.insert(position,texttag)
        else:
            self.cleanup_page(appendtag)

    def preprocess_html(self, soup):
        self.append_page(soup, soup.body, 3)
        return self.adeify_images(soup)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
onion recipe	Phoul	Recipes	0	12-19-2010 02:56 PM
Screen Cracked	omro	Astak EZReader	13	05-07-2010 11:39 AM
DH cracked my K2	lala	Amazon Kindle	6	02-22-2010 04:43 PM
Cracked my SmartQV7	Renji	Alternative Devices	1	12-25-2009 04:57 PM
Unutterably Silly Adults Go Wild Over Latest In Childrens Picture Book Series - The Onion	AprilHare	Lounge	24	12-14-2009 11:09 AM

01-12-2011, 02:50 AM	#2
Phoul Dances with penguins Posts: 54 Karma: 10 Join Date: Oct 2010 Device: Sony PRS-350	The onion recipe also crashes the prs-350, ive reported this elsewhere but it was never paid attention to...

01-13-2011, 02:42 PM	#3
Phoul Dances with penguins Posts: 54 Karma: 10 Join Date: Oct 2010 Device: Sony PRS-350	Anyone have any ideas on the onion?

02-22-2011, 09:13 AM	#4
Algiedi Overenthusiastic Noob Posts: 69 Karma: 896 Join Date: Feb 2011 Location: France Device: Kindle 3	This is very relevant to my interests. Unfortunately, I have no programming knowledge whatsoever, and managing to clean up the structure and modify some css is the limit of my ability so far. Please do let us know if you find the solution for Cracked.

02-22-2011, 12:04 PM	#5
kiklop74 Guru Posts: 800 Karma: 194644 Join Date: Dec 2007 Location: Argentina Device: Kindle Voyage	Onion is already fixed.

04-19-2011, 03:51 AM	#7
Shelleyleo Member Posts: 18 Karma: 716 Join Date: Jun 2009 Location: San Francisco, CA, USA Device: Astak EZReader, Sony PRS-350	So...did this recipe break or is it just me? I tried the built-in, tried this one...and nada (okay, summary/blurb but no full article). What I want and what it /looks/ like should happen is the starting menu, the articles menu/summary/blurb, then after that the actual articles in their entirety. What I get: the starting menu, the summary/blurb bits, and "blank" pages except for header/footer from Calibre and a link to where the article was "downloaded" from. If I follow the link shown, I go to the Cracked site and the article appears without any sort of errors or issues...but that's on my PC, with a browser and internet connection, I can't do that on my PRS-350. I have no clue whatsoever about recipes, I've managed to modify a couple of builtins (mostly commenting/uncommenting the ones that are intended to be modified per user wants) but trying to work out what all is going on in this complex of one...so very not my skillset. There are other recipes that I have this same issue with, but I'm still mucking about with them to see if I can easily work out the problem, but the recipes are all mostly uncomprehensible to me - even with the tutorials out there, I am not a programmer, I've tried to learn a number of programming "things" (languages, partial subsets of instructions, and so on) in my past and my brain just looks at it and goes..."interesting...what is it? It seems to be words but they make no sense" (and that's looking at the how-tos and guides, nevermind what code samples usually do to my brain) Someone smarter than I am with programming and recipes...help? Please? I don't even know what info I should give you if it is a problem on my end somehow.

Advert

Advert