Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 01-10-2011, 11:23 AM   #1
limnoski
Junior Member
limnoski began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Jan 2011
Device: Sony PRS-650
Cracked.com and The Onion

So I have had a go at setting up a recipe for Cracked.com as I can't see any existing one out there.

The Cracked.com website is a messy piece of work and I can't for the life of me get it to work. The main problem is I can't get the second pages to append to the first pages and I can't seem to figure out how to get rid of all the tables so it breaks it up into pages for me to read on my Sony PRS-650. I used the Adventure Gamers recipe and modified it only a little bit:

Code:
class Cracked(BasicNewsRecipe):
    title                 = u'Cracked.com'
    language              = 'en'
    description           = 'Americas Only Humor and Video Site, since 1958'
    publisher             = 'Cracked'
    category              = 'comedy, lists'
    oldest_article        = 7
    delay                 = 10
    max_articles_per_feed = 50
    no_stylesheets        = True
    encoding              = 'cp1252'
    remove_javascript     = True
    use_embedded_content  = False
    INDEX                 = u'http://www.cracked.com'
    extra_css             = """
                                .pageheader_type{font-size: x-large; font-weight: bold; color: #828D74}
                                .pageheader_title{font-size: xx-large; color: #394128}
                                .pageheader_byline{font-size: small; font-weight: bold; color: #394128}
                                .score_bg {display: inline; width: 100%; margin-bottom: 2em}
                                .score_column_1{ padding-left: 10px; font-size: small; width: 50%}
                                .score_column_2{ padding-left: 10px; font-size: small; width: 50%}
                                .score_column_3{ padding-left: 10px; font-size: small; width: 50%}
                                .score_header{font-size: large; color: #50544A}
                                .bodytext{display: block}
                                body{font-family: Helvetica,Arial,sans-serif}
                            """

    conversion_options = {
                          'comment'   : description
                        , 'tags'      : category
                        , 'publisher' : publisher
                        , 'language'  : language
                        }

    keep_only_tags    = [
                       dict(name='div', attrs={'class':['Column1']})                  
                        ]

    remove_tags = [
                       dict(name='div', attrs={'id':['googlead_1','fb-like-article','comments_section']})
	   ,dict(name='div', attrs={'class':['share_buttons_col_1','GenericModule1']})
	   ,dict(name = 'ul', attrs={'class':['Nav6']})
                        ]

    remove_tags_after = [dict(name='div', attrs={'id':'fb-like-article'})]
    remove_attributes = ['width','height']

    feeds = [(u'Articles', u'http://feeds.feedburner.com/CrackedRSS')]

    def get_article_url(self, article):
        return article.get('guid',  None)

    def append_page(self, soup, appendtag, position):
        pager = soup.find('li',attrs={'class':'forward'})
        if pager:
           nexturl = self.INDEX + pager.a['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'class':'Column1'})
           for it in texttag.findAll(style=True):
               del it['style']
           newpos = len(texttag.contents)
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)

    def preprocess_html(self, soup):
        for item in soup.findAll(style=True):
            del item['style']
        self.append_page(soup, soup.body, 3)
        pager = soup.find('div',attrs={'class':'prev_next'})
        if pager:
           pager.extract()
        return self.adeify_images(soup)
If anyone could help me that would be great.

I am also having problems with 'The Onion' recipe. It seems that there is something in the code for that site that crashes my Sony PRS-650. I read somewhere else that this is a firmware problem but you can just remove whatever bit of code is causing the problem. Does anyone know what the bit of HTML in 'The Onion' site is that might be causing the problem? And how do I actually get rid of it using the recipe that comes with Calibre?

I can get all my serious news fine but if I want to have a light read of some comedy I seem to be out of luck!
limnoski is offline   Reply With Quote
Old 01-12-2011, 02:50 AM   #2
Phoul
Dances with penguins
Phoul began at the beginning.
 
Phoul's Avatar
 
Posts: 54
Karma: 10
Join Date: Oct 2010
Device: Sony PRS-350
The onion recipe also crashes the prs-350, ive reported this elsewhere but it was never paid attention to...
Phoul is offline   Reply With Quote
Advert
Old 01-13-2011, 02:42 PM   #3
Phoul
Dances with penguins
Phoul began at the beginning.
 
Phoul's Avatar
 
Posts: 54
Karma: 10
Join Date: Oct 2010
Device: Sony PRS-350
Anyone have any ideas on the onion?
Phoul is offline   Reply With Quote
Old 02-22-2011, 09:13 AM   #4
Algiedi
Overenthusiastic Noob
Algiedi has learned how to read e-booksAlgiedi has learned how to read e-booksAlgiedi has learned how to read e-booksAlgiedi has learned how to read e-booksAlgiedi has learned how to read e-booksAlgiedi has learned how to read e-booksAlgiedi has learned how to read e-books
 
Algiedi's Avatar
 
Posts: 69
Karma: 896
Join Date: Feb 2011
Location: France
Device: Kindle 3
This is very relevant to my interests.

Unfortunately, I have no programming knowledge whatsoever, and managing to clean up the structure and modify some css is the limit of my ability so far.

Please do let us know if you find the solution for Cracked.
Algiedi is offline   Reply With Quote
Old 02-22-2011, 12:04 PM   #5
kiklop74
Guru
kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.
 
kiklop74's Avatar
 
Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
Onion is already fixed.
kiklop74 is offline   Reply With Quote
Advert
Old 03-26-2011, 08:03 PM   #6
Nudgenudge
Junior Member
Nudgenudge began at the beginning.
 
Posts: 1
Karma: 32
Join Date: Mar 2011
Device: Kindle DX
So, since I wanted Cracked.com too, I modified the recipe to have the working one:

Code:
from calibre.web.feeds.news import BasicNewsRecipe
import re

class Cracked(BasicNewsRecipe):
    title                 = u'Cracked.com'
    language              = 'en'
    description            = 'America''s Only Humor and Video Site, since 1958'
    publisher             = 'Cracked'
    category              = 'comedy, lists'
    oldest_article        = 2
    delay                 = 10
    max_articles_per_feed = 2
    no_stylesheets        = True
    encoding              = 'cp1252'
    remove_javascript     = True
    use_embedded_content  = False
    INDEX                 = u'http://www.cracked.com'
    extra_css             = """
                                .pageheader_type{font-size: x-large; font-weight: bold; color: #828D74}
                                .pageheader_title{font-size: xx-large; color: #394128}
                                .pageheader_byline{font-size: small; font-weight: bold; color: #394128}
                                .score_bg {display: inline; width: 100%; margin-bottom: 2em}
                                .score_column_1{ padding-left: 10px; font-size: small; width: 50%}
                                .score_column_2{ padding-left: 10px; font-size: small; width: 50%}
                                .score_column_3{ padding-left: 10px; font-size: small; width: 50%}
                                .score_header{font-size: large; color: #50544A}
                                .bodytext{display: block}
                                body{font-family: Helvetica,Arial,sans-serif}
                            """

    conversion_options = {
                          'comment'   : description
                        , 'tags'      : category
                        , 'publisher' : publisher
                        , 'language'  : language
                        , 'linearize_tables' : True
                        }

    keep_only_tags    =  [
                        dict(name='div', attrs={'class':['Column1']})                  
                        ]

    feeds = [(u'Articles', u'http://feeds.feedburner.com/CrackedRSS')]

    def get_article_url(self, article):
        return article.get('guid',  None)

    def cleanup_page(self, soup):
        for item in soup.findAll(style=True):
            del item['style']
	    for alink in soup.findAll('a'):
	        if alink.string is not None:
	            tstr = alink.string
	            alink.replaceWith(tstr)
        for div_to_remove in soup.findAll('div', attrs={'id':['googlead_1','fb-like-article','comments_section']}):
            div_to_remove.extract()
        for div_to_remove in soup.findAll('div', attrs={'class':['share_buttons_col_1','GenericModule1']}):
            div_to_remove.extract()
        for div_to_remove in soup.findAll('div', attrs={'class':re.compile("prev_next")}):
            div_to_remove.extract()
        for ul_to_remove in soup.findAll('ul', attrs={'class':['Nav6']}):
            ul_to_remove.extract()
        for image in soup.findAll('img', attrs={'alt': 'article image'}):
            image.extract()

    def append_page(self, soup, appendtag, position):
        pager = soup.find('a',attrs={'class':'next_arrow_active'})
        if pager:
            nexturl = self.INDEX + pager['href']
            soup2 = self.index_to_soup(nexturl)
            texttag = soup2.find('div', attrs={'class':re.compile("userStyled")})
            newpos = len(texttag.contents)
            self.append_page(soup2,texttag,newpos)
            texttag.extract()
            self.cleanup_page(appendtag)
            appendtag.insert(position,texttag)
        else:
            self.cleanup_page(appendtag)

    def preprocess_html(self, soup):
        self.append_page(soup, soup.body, 3)
        return self.adeify_images(soup)
Since it took me a while to do it, here it is for other Cracked fans.
Nudgenudge is offline   Reply With Quote
Old 04-19-2011, 03:51 AM   #7
Shelleyleo
Member
Shelleyleo will become famous soon enoughShelleyleo will become famous soon enoughShelleyleo will become famous soon enoughShelleyleo will become famous soon enoughShelleyleo will become famous soon enoughShelleyleo will become famous soon enoughShelleyleo will become famous soon enough
 
Posts: 18
Karma: 716
Join Date: Jun 2009
Location: San Francisco, CA, USA
Device: Astak EZReader, Sony PRS-350
So...did this recipe break or is it just me? I tried the built-in, tried this one...and nada (okay, summary/blurb but no full article).

What I want and what it /looks/ like should happen is the starting menu, the articles menu/summary/blurb, then after that the actual articles in their entirety.

What I get: the starting menu, the summary/blurb bits, and "blank" pages except for header/footer from Calibre and a link to where the article was "downloaded" from.

If I follow the link shown, I go to the Cracked site and the article appears without any sort of errors or issues...but that's on my PC, with a browser and internet connection, I can't do that on my PRS-350.

I have no clue whatsoever about recipes, I've managed to modify a couple of builtins (mostly commenting/uncommenting the ones that are intended to be modified per user wants) but trying to work out what all is going on in this complex of one...so very not my skillset.

There are other recipes that I have this same issue with, but I'm still mucking about with them to see if I can easily work out the problem, but the recipes are all mostly uncomprehensible to me - even with the tutorials out there, I am not a programmer, I've tried to learn a number of programming "things" (languages, partial subsets of instructions, and so on) in my past and my brain just looks at it and goes..."interesting...what is it? It seems to be words but they make no sense" (and that's looking at the how-tos and guides, nevermind what code samples usually do to my brain)

Someone smarter than I am with programming and recipes...help? Please? I don't even know what info I should give you if it is a problem on my end somehow.
Shelleyleo is offline   Reply With Quote
Old 04-19-2011, 10:13 AM   #8
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by Shelleyleo View Post
What I want and what it /looks/ like should happen is the starting menu, the articles menu/summary/blurb, then after that the actual articles in their entirety.

What I get: the starting menu, the summary/blurb bits, and "blank" pages except for header/footer from Calibre and a link to where the article was "downloaded" from.
This is normal behavior for a broken recipe The "feed" is standardized, so it almost always works no matter how much a site changes. The feed supplies "the starting menu" and "the summary/blurb bits." The articles vary widely, and as soon as the site changes, the articles disappear, leaving only a blank page and a Calibre-created link to the article source page.
Starson17 is offline   Reply With Quote
Old 04-19-2011, 05:47 PM   #9
Shelleyleo
Member
Shelleyleo will become famous soon enoughShelleyleo will become famous soon enoughShelleyleo will become famous soon enoughShelleyleo will become famous soon enoughShelleyleo will become famous soon enoughShelleyleo will become famous soon enoughShelleyleo will become famous soon enough
 
Posts: 18
Karma: 716
Join Date: Jun 2009
Location: San Francisco, CA, USA
Device: Astak EZReader, Sony PRS-350
Quote:
Originally Posted by Starson17 View Post
This is normal behavior for a broken recipe The "feed" is standardized, so it almost always works no matter how much a site changes. The feed supplies "the starting menu" and "the summary/blurb bits." The articles vary widely, and as soon as the site changes, the articles disappear, leaving only a blank page and a Calibre-created link to the article source page.
Lovely...is there a way to resolve it or is it pretty much destined to break everytime the feed changes/adds an article? I have found that a decent chunk of the built in recipes for news feeds I want to grab are doing this same thing, I get the preview blurbs that are part of the RSS feed itself but the complete articles are simply a "this article downloaded from /link here/" and no actual article text.

It seems that is /should/ be grabbing the articles as that is what a number of recipes grab and it appears that it is the intended behaviour, but for whatever reason they are broken in some way that means it isn't grabbing the full text of the articles - the question more importantly is...how do I fix it with literally no understanding of the advanced code portion of modifying a recipe? I know the typical answer to that is probably "learn the code enough to manipulate it" but I do wish to stress that it isn't that I don't /want/ to learn how to manipulate the code, but for whatever reason no matter how badly I wish to learn, my brain is like teflon when it comes to anything related to programming - the information I read never gets absorbed enough for me to make use ot it. And I do come from a technical background, so you'd think I'd at least comprehend the basic gist of it having been exposed to it so much.

Anyway, I'm having this issue with at least Salon, Jezebel, Cracked, and who knows what others I'd like to read if I could, so it would behoove me to figure out how to fix them myself, but barring that, is anyone willing to take a crack at em and see if they just need a simple sort of fix? If not, I'll cope, I haven't been reading them on my previous eReader until trying now, so it isn't like it is the end of the world if I can't have them work, I'm just all excited that I can read some of my news stuff on my new sony prs350 and I want to read more.
Shelleyleo is offline   Reply With Quote
Old 04-20-2011, 08:55 AM   #10
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by Shelleyleo View Post
Lovely...is there a way to resolve it
Yes. The recipe has to be fixed. Hopefully the recipe author will do it. If not, you or anyone else can tackle it.
Quote:
or is it pretty much destined to break everytime the feed changes/adds an article?
You misunderstood me. The recipe doesn't break because the feed changes. It breaks because the structure of the web page that the feed points to has changed. Feeds are standardized quite well and the recipe system can almost always read them. It's the web pages that change drastically. Yes, we are all doomed to chasing the changes in the web pages. I have recipes I maintain change on me three times in a week.

Quote:
I have found that a decent chunk of the built in recipes for news feeds I want to grab are doing this same thing, I get the preview blurbs that are part of the RSS feed itself but the complete articles are simply a "this article downloaded from /link here/" and no actual article text.
Yes, many recipes are written by authors at the request of others. If the author doesn't read that recipe, he won't notice it has broken. You can PM the author to let him know. The author's name is in the recipe.
Starson17 is offline   Reply With Quote
Old 04-20-2011, 04:00 PM   #11
Shelleyleo
Member
Shelleyleo will become famous soon enoughShelleyleo will become famous soon enoughShelleyleo will become famous soon enoughShelleyleo will become famous soon enoughShelleyleo will become famous soon enoughShelleyleo will become famous soon enoughShelleyleo will become famous soon enough
 
Posts: 18
Karma: 716
Join Date: Jun 2009
Location: San Francisco, CA, USA
Device: Astak EZReader, Sony PRS-350
Quote:
Originally Posted by Starson17 View Post
Yes. The recipe has to be fixed. Hopefully the recipe author will do it. If not, you or anyone else can tackle it.

You misunderstood me. The recipe doesn't break because the feed changes. It breaks because the structure of the web page that the feed points to has changed. Feeds are standardized quite well and the recipe system can almost always read them. It's the web pages that change drastically. Yes, we are all doomed to chasing the changes in the web pages. I have recipes I maintain change on me three times in a week.


Yes, many recipes are written by authors at the request of others. If the author doesn't read that recipe, he won't notice it has broken. You can PM the author to let him know. The author's name is in the recipe.
Ahh, all good to know - thank you very much for your replies, you have been very helpful.
Shelleyleo is offline   Reply With Quote
Old 04-29-2011, 11:56 AM   #12
limnoski
Junior Member
limnoski began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Jan 2011
Device: Sony PRS-650
Hi guys,

I have to say thanks to Nudge for getting this working in the first place. I managed to get a couple of downloads before they changed the website!
So I have tried again to get this recipe working, I thought it might be a simple case of changing the tags. However it looks like it is not so simple...

Here is as far as I have got with Nudge's recipe:
Code:
from calibre.web.feeds.news import BasicNewsRecipe
import re

class Cracked(BasicNewsRecipe):
    title                 = u'Cracked.com'
    __author__            = u'Nudgenudge'
    language              = 'en'
    description            = 'America''s Only Humor and Video Site, since 1958'
    publisher             = 'Cracked'
    category              = 'comedy, lists'
    oldest_article        = 2
    delay                 = 10
    max_articles_per_feed = 2
    no_stylesheets        = True
    encoding              = 'cp1252'
    remove_javascript     = True
    use_embedded_content  = False
    INDEX                 = u'http://www.cracked.com'
    extra_css             = """
                                .pageheader_type{font-size: x-large; font-weight: bold; color: #828D74}
                                .pageheader_title{font-size: xx-large; color: #394128}
                                .pageheader_byline{font-size: small; font-weight: bold; color: #394128}
                                .score_bg {display: inline; width: 100%; margin-bottom: 2em}
                                .score_column_1{ padding-left: 10px; font-size: small; width: 50%}
                                .score_column_2{ padding-left: 10px; font-size: small; width: 50%}
                                .score_column_3{ padding-left: 10px; font-size: small; width: 50%}
                                .score_header{font-size: large; color: #50544A}
                                .bodytext{display: block}
                                body{font-family: Helvetica,Arial,sans-serif}
                            """

    conversion_options = {
                          'comment'   : description
                        , 'tags'      : category
                        , 'publisher' : publisher
                        , 'language'  : language
                        , 'linearize_tables' : True
                        }

    keep_only_tags    =  [
                        dict(name='section', attrs={'class':['body']})
                        ]

    feeds = [(u'Articles', u'http://feeds.feedburner.com/CrackedRSS')]

    def get_article_url(self, article):
        return article.get('guid',  None)

    def cleanup_page(self, soup):
        for item in soup.findAll(style=True):
            del item['style']
	    for alink in soup.findAll('a'):
	        if alink.string is not None:
	            tstr = alink.string
	            alink.replaceWith(tstr)
        for div_to_remove in soup.findAll('div', attrs={'id':['persistent-share','inline-share-buttons']}):
            div_to_remove.extract()
        for div_to_remove in soup.findAll('div', attrs={'class':['FacebookLike','shareBar']}):
            div_to_remove.extract()
        for nav_to_remove in soup.findAll('nav', attrs={'class':re.compile("PaginationContent")}):
            nav_to_remove.extract()
        for image in soup.findAll('img', attrs={'alt': 'article image'}):
            image.extract()

    def append_page(self, soup, appendtag, position):
        pager = soup.find('a',attrs={'class':'next'})
        if pager:
            nexturl = self.INDEX + pager['href']
            soup2 = self.index_to_soup(nexturl)
            texttag = soup2.find('Article', attrs={'class':re.compile("Article Module")})
            newpos = len(texttag.contents)
            self.append_page(soup2,texttag,newpos)
            texttag.extract()
            self.cleanup_page(appendtag)
            appendtag.insert(position,texttag)
        else:
            self.cleanup_page(appendtag)

    def preprocess_html(self, soup):
        self.append_page(soup, soup.body, 3)
        return self.adeify_images(soup)
I have also tried another variation on the recipe that uses the cracked website instead of the RSS feed, it has the same problem however. Here it is below if anyone is interested, the RSS feed is still much neater:

Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup
from calibre import strftime
import re

class Cracked(BasicNewsRecipe):
    title                 = u'Cracked.com'
    __author__            = u'Nudgenudge'
    language              = 'en'
    description            = 'America''s Only Humor and Video Site, since 1958'
    publisher             = 'Cracked'
    category              = 'comedy, lists'
    oldest_article        = 2
    delay                 = 10
    max_articles_per_feed = 2
    no_stylesheets        = True
    encoding              = 'cp1252'
    remove_javascript     = True
    use_embedded_content  = False
    INDEX                 = u'http://www.cracked.com'
    extra_css             = """
                                .pageheader_type{font-size: x-large; font-weight: bold; color: #828D74}
                                .pageheader_title{font-size: xx-large; color: #394128}
                                .pageheader_byline{font-size: small; font-weight: bold; color: #394128}
                                .score_bg {display: inline; width: 100%; margin-bottom: 2em}
                                .score_column_1{ padding-left: 10px; font-size: small; width: 50%}
                                .score_column_2{ padding-left: 10px; font-size: small; width: 50%}
                                .score_column_3{ padding-left: 10px; font-size: small; width: 50%}
                                .score_header{font-size: large; color: #50544A}
                                .bodytext{display: block}
                                body{font-family: Helvetica,Arial,sans-serif}
                            """

    conversion_options = {
                          'comment'   : description
                        , 'tags'      : category
                        , 'publisher' : publisher
                        , 'language'  : language
                        , 'linearize_tables' : True
                        }

    keep_only_tags    =  [
                        dict(name='section', attrs={'class':['body']})
                        ]

    def parse_index(self):
        articles = []
        rawc = self.index_to_soup('http://www.cracked.com/funny-articles.html',True)
        soup = BeautifulSoup(rawc,fromEncoding=self.encoding)
        

        for item in soup.findAll(attrs={'class':'content'}):
            description = ''
            title_prefix = ''
            feed_link = item.find('a',href=True)
            descript = item.find('a')
            if descript:
               description = self.tag_to_string(descript)
            if feed_link:
                url   = feed_link['href']
                title = title_prefix + self.tag_to_string(feed_link)
                date  = strftime(self.timefmt)
                articles.append({
                                  'title'      :title
                                 ,'date'       :date
                                 ,'url'        :url
                                 ,'description':description
                                })
        return [(self.tag_to_string(soup.find('title')), articles)]

    def get_article_url(self, article):
        return article.get('guid',  None)

    def cleanup_page(self, soup):
        for item in soup.findAll(style=True):
            del item['style']
	    for alink in soup.findAll('a'):
	        if alink.string is not None:
	            tstr = alink.string
	            alink.replaceWith(tstr)
        for div_to_remove in soup.findAll('div', attrs={'id':['persistent-share','inline-share-buttons']}):
            div_to_remove.extract()
        for div_to_remove in soup.findAll('div', attrs={'class':['FacebookLike','shareBar']}):
            div_to_remove.extract()
        for nav_to_remove in soup.findAll('nav', attrs={'class':re.compile("PaginationContent")}):
            nav_to_remove.extract()
        for image in soup.findAll('img', attrs={'alt': 'article image'}):
            image.extract()

    def append_page(self, soup, appendtag, position):
        pager = soup.find('a',attrs={'class':'next'})
        if pager:
            nexturl = self.INDEX + pager['href']
            soup2 = self.index_to_soup(nexturl)
            texttag = soup2.find('Article', attrs={'class':re.compile("Article Module")})
            newpos = len(texttag.contents)
            self.append_page(soup2,texttag,newpos)
            texttag.extract()
            self.cleanup_page(appendtag)
            appendtag.insert(position,texttag)
        else:
            self.cleanup_page(appendtag)

    def preprocess_html(self, soup):
        self.append_page(soup, soup.body, 3)
        return self.adeify_images(soup)

Last edited by limnoski; 04-29-2011 at 02:41 PM. Reason: used quotes instead of code
limnoski is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
onion recipe Phoul Recipes 0 12-19-2010 02:56 PM
Screen Cracked omro Astak EZReader 13 05-07-2010 11:39 AM
DH cracked my K2 lala Amazon Kindle 6 02-22-2010 04:43 PM
Cracked my SmartQV7 Renji Alternative Devices 1 12-25-2009 04:57 PM
Unutterably Silly Adults Go Wild Over Latest In Childrens Picture Book Series - The Onion AprilHare Lounge 24 12-14-2009 11:09 AM


All times are GMT -4. The time now is 04:06 AM.


MobileRead.com is a privately owned, operated and funded community.