Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 06-30-2011, 06:22 PM   #1
luczak
Junior Member
luczak began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Jun 2011
Device: All New Nook Touch
(broken recipe) Cracked.com not working

The cracked.com recipe is not working correctly. it creates a proper section menu with titles of articles and their summaries, but the actual articles are just blank.

Here is the current code for refrence:
Code:
from calibre.web.feeds.news import BasicNewsRecipe
import re

class Cracked(BasicNewsRecipe):
    title                 = u'Cracked.com'
    __author__            = u'Nudgenudge'
    language              = 'en'
    description            = 'America''s Only Humor and Video Site, since 1958'
    publisher             = 'Cracked'
    category              = 'comedy, lists'
    oldest_article        = 2
    delay                 = 10
    max_articles_per_feed = 2
    no_stylesheets        = True
    encoding              = 'cp1252'
    remove_javascript     = True
    use_embedded_content  = False
    INDEX                 = u''
    extra_css             = """
                                .pageheader_type{font-size: x-large; font-weight: bold; color: #828D74}
                                .pageheader_title{font-size: xx-large; color: #394128}
                                .pageheader_byline{font-size: small; font-weight: bold; color: #394128}
                                .score_bg {display: inline; width: 100%; margin-bottom: 2em}
                                .score_column_1{ padding-left: 10px; font-size: small; width: 50%}
                                .score_column_2{ padding-left: 10px; font-size: small; width: 50%}
                                .score_column_3{ padding-left: 10px; font-size: small; width: 50%}
                                .score_header{font-size: large; color: #50544A}
                                .bodytext{display: block}
                                body{font-family: Helvetica,Arial,sans-serif}
                            """

    conversion_options = {
                          'comment'   : description
                        , 'tags'      : category
                        , 'publisher' : publisher
                        , 'language'  : language
                        , 'linearize_tables' : True
                        }

    keep_only_tags    =  [
                        dict(name='div', attrs={'class':['Column1']})
                        ]

    feeds = [(u'Articles', u'')]

    def get_article_url(self, article):
        return article.get('guid',  None)

    def cleanup_page(self, soup):
        for item in soup.findAll(style=True):
            del item['style']
        for alink in soup.findAll('a'):
            if alink.string is not None:
                tstr = alink.string
                alink.replaceWith(tstr)
        for div_to_remove in soup.findAll('div', attrs={'id':['googlead_1','fb-like-article','comments_section']}):
            div_to_remove.extract()
        for div_to_remove in soup.findAll('div', attrs={'class':['share_buttons_col_1','GenericModule1']}):
            div_to_remove.extract()
        for div_to_remove in soup.findAll('div', attrs={'class':re.compile("prev_next")}):
            div_to_remove.extract()
        for ul_to_remove in soup.findAll('ul', attrs={'class':['Nav6']}):
            ul_to_remove.extract()
        for image in soup.findAll('img', attrs={'alt': 'article image'}):
            image.extract()

    def append_page(self, soup, appendtag, position):
        pager = soup.find('a',attrs={'class':'next_arrow_active'})
        if pager:
            nexturl = self.INDEX + pager['href']
            soup2 = self.index_to_soup(nexturl)
            texttag = soup2.find('div', attrs={'class':re.compile("userStyled")})
            newpos = len(texttag.contents)
            self.append_page(soup2,texttag,newpos)
            texttag.extract()
            self.cleanup_page(appendtag)
            appendtag.insert(position,texttag)
        else:
            self.cleanup_page(appendtag)

    def preprocess_html(self, soup):
        self.append_page(soup, soup.body, 3)
        return self.adeify_images(soup)
luczak is offline   Reply With Quote
Old 07-03-2011, 11:10 PM   #2
UnWeave
Junior Member
UnWeave began at the beginning.
 
UnWeave's Avatar
 
Posts: 4
Karma: 12
Join Date: Jun 2011
Device: none
I eventually managed to hack something together for Cracked, which works (for now):

Spoiler:
Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class Cracked(BasicNewsRecipe):
    title                 = u'Cracked.com'
    language              = 'en'
    description           = "America's Only HumorSite since 1958"
    publisher             = 'Cracked'
    category              = 'comedy, lists'
    oldest_article        = 3 #days
    max_articles_per_feed = 100
    no_stylesheets        = True
    encoding              = 'ascii'
    remove_javascript     = True
    use_embedded_content  = False

    feeds = [ (u'Articles', u'http://feeds.feedburner.com/CrackedRSS/') ]

    conversion_options = {
                          'comment'   : description
                        , 'tags'      : category
                        , 'publisher' : publisher
                        , 'language'  : language
                        }

    remove_tags_before = dict(id='PrimaryContent')
    
    remove_tags_after = dict(name='div', attrs={'class':'shareBar'})
        
    remove_tags = [ dict(name='div', attrs={'class':['social',
                                                     'FacebookLike',
                                                     'shareBar'
                                                     ]}),

                    dict(name='div', attrs={'id':['inline-share-buttons',
                                                  ]}),

                    dict(name='span', attrs={'class':['views',
                                                      'KonaFilter'
                                                      ]}),
                    #dict(name='img'),
                    ]
    
    def appendPage(self, soup, appendTag, position):
        # Check if article has multiple pages
        pageNav = soup.find('nav', attrs={'class':'PaginationContent'})
        if pageNav:
            # Check not at last page
            nextPage = pageNav.find('a', attrs={'class':'next'})
            if nextPage:
                nextPageURL = nextPage['href']
                nextPageSoup = self.index_to_soup(nextPageURL)
                # 8th <section> tag contains article content
                nextPageContent = nextPageSoup.findAll('section')[7]
                newPosition = len(nextPageContent.contents)
                self.appendPage(nextPageSoup,nextPageContent,newPosition)
                nextPageContent.extract()
                pageNav.extract()
                appendTag.insert(position,nextPageContent)

    def preprocess_html(self, soup):
        self.appendPage(soup, soup.body, 3)
        return soup


With all the images in the articles I find it makes for a file of around 4MB, so you may want to change oldest_article to 2 instead. You can also remove the # in front of dict(name=('img')) to remove all the images. You get a way smaller files size and (on my kindle) the next page loads quicker, but you'll obviously be missing some content, plus the captions will still be there.

I haven't applied any extra formatting to it, and I haven't tried it for every kind of article, though it will properly stitch together their 2-pagers.

If you find any problems with it let me know.
UnWeave is offline   Reply With Quote
Advert
Reply

Tags
broken, cracked, cracked.com, recipe


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
[Source Broken] Cracked.com marcellolins Recipes 1 05-23-2011 09:35 AM
BigOven recipe broken kenr276 Recipes 5 04-18-2011 11:41 AM
volkskrant.recipe broken m.tarenskeen Recipes 9 01-01-2011 11:18 AM
Engadget Recipe Broken pars_andy Calibre 1 12-01-2009 10:39 PM
Economist Recipe - broken? dieterpops Calibre 1 02-20-2009 09:14 PM


All times are GMT -4. The time now is 07:23 PM.


MobileRead.com is a privately owned, operated and funded community.