01-10-2011, 11:23 AM | #1 |
Junior Member
Posts: 2
Karma: 10
Join Date: Jan 2011
Device: Sony PRS-650
|
Cracked.com and The Onion
So I have had a go at setting up a recipe for Cracked.com as I can't see any existing one out there.
The Cracked.com website is a messy piece of work and I can't for the life of me get it to work. The main problem is I can't get the second pages to append to the first pages and I can't seem to figure out how to get rid of all the tables so it breaks it up into pages for me to read on my Sony PRS-650. I used the Adventure Gamers recipe and modified it only a little bit: Code:
class Cracked(BasicNewsRecipe): title = u'Cracked.com' language = 'en' description = 'Americas Only Humor and Video Site, since 1958' publisher = 'Cracked' category = 'comedy, lists' oldest_article = 7 delay = 10 max_articles_per_feed = 50 no_stylesheets = True encoding = 'cp1252' remove_javascript = True use_embedded_content = False INDEX = u'http://www.cracked.com' extra_css = """ .pageheader_type{font-size: x-large; font-weight: bold; color: #828D74} .pageheader_title{font-size: xx-large; color: #394128} .pageheader_byline{font-size: small; font-weight: bold; color: #394128} .score_bg {display: inline; width: 100%; margin-bottom: 2em} .score_column_1{ padding-left: 10px; font-size: small; width: 50%} .score_column_2{ padding-left: 10px; font-size: small; width: 50%} .score_column_3{ padding-left: 10px; font-size: small; width: 50%} .score_header{font-size: large; color: #50544A} .bodytext{display: block} body{font-family: Helvetica,Arial,sans-serif} """ conversion_options = { 'comment' : description , 'tags' : category , 'publisher' : publisher , 'language' : language } keep_only_tags = [ dict(name='div', attrs={'class':['Column1']}) ] remove_tags = [ dict(name='div', attrs={'id':['googlead_1','fb-like-article','comments_section']}) ,dict(name='div', attrs={'class':['share_buttons_col_1','GenericModule1']}) ,dict(name = 'ul', attrs={'class':['Nav6']}) ] remove_tags_after = [dict(name='div', attrs={'id':'fb-like-article'})] remove_attributes = ['width','height'] feeds = [(u'Articles', u'http://feeds.feedburner.com/CrackedRSS')] def get_article_url(self, article): return article.get('guid', None) def append_page(self, soup, appendtag, position): pager = soup.find('li',attrs={'class':'forward'}) if pager: nexturl = self.INDEX + pager.a['href'] soup2 = self.index_to_soup(nexturl) texttag = soup2.find('div', attrs={'class':'Column1'}) for it in texttag.findAll(style=True): del it['style'] newpos = len(texttag.contents) self.append_page(soup2,texttag,newpos) texttag.extract() appendtag.insert(position,texttag) def preprocess_html(self, soup): for item in soup.findAll(style=True): del item['style'] self.append_page(soup, soup.body, 3) pager = soup.find('div',attrs={'class':'prev_next'}) if pager: pager.extract() return self.adeify_images(soup) I am also having problems with 'The Onion' recipe. It seems that there is something in the code for that site that crashes my Sony PRS-650. I read somewhere else that this is a firmware problem but you can just remove whatever bit of code is causing the problem. Does anyone know what the bit of HTML in 'The Onion' site is that might be causing the problem? And how do I actually get rid of it using the recipe that comes with Calibre? I can get all my serious news fine but if I want to have a light read of some comedy I seem to be out of luck! |
01-12-2011, 02:50 AM | #2 |
Dances with penguins
Posts: 54
Karma: 10
Join Date: Oct 2010
Device: Sony PRS-350
|
The onion recipe also crashes the prs-350, ive reported this elsewhere but it was never paid attention to...
|
Advert | |
|
01-13-2011, 02:42 PM | #3 |
Dances with penguins
Posts: 54
Karma: 10
Join Date: Oct 2010
Device: Sony PRS-350
|
Anyone have any ideas on the onion?
|
02-22-2011, 09:13 AM | #4 |
Overenthusiastic Noob
Posts: 69
Karma: 896
Join Date: Feb 2011
Location: France
Device: Kindle 3
|
This is very relevant to my interests.
Unfortunately, I have no programming knowledge whatsoever, and managing to clean up the structure and modify some css is the limit of my ability so far. Please do let us know if you find the solution for Cracked. |
02-22-2011, 12:04 PM | #5 |
Guru
Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
|
Onion is already fixed.
|
Advert | |
|
03-26-2011, 08:03 PM | #6 |
Junior Member
Posts: 1
Karma: 32
Join Date: Mar 2011
Device: Kindle DX
|
So, since I wanted Cracked.com too, I modified the recipe to have the working one:
Code:
from calibre.web.feeds.news import BasicNewsRecipe import re class Cracked(BasicNewsRecipe): title = u'Cracked.com' language = 'en' description = 'America''s Only Humor and Video Site, since 1958' publisher = 'Cracked' category = 'comedy, lists' oldest_article = 2 delay = 10 max_articles_per_feed = 2 no_stylesheets = True encoding = 'cp1252' remove_javascript = True use_embedded_content = False INDEX = u'http://www.cracked.com' extra_css = """ .pageheader_type{font-size: x-large; font-weight: bold; color: #828D74} .pageheader_title{font-size: xx-large; color: #394128} .pageheader_byline{font-size: small; font-weight: bold; color: #394128} .score_bg {display: inline; width: 100%; margin-bottom: 2em} .score_column_1{ padding-left: 10px; font-size: small; width: 50%} .score_column_2{ padding-left: 10px; font-size: small; width: 50%} .score_column_3{ padding-left: 10px; font-size: small; width: 50%} .score_header{font-size: large; color: #50544A} .bodytext{display: block} body{font-family: Helvetica,Arial,sans-serif} """ conversion_options = { 'comment' : description , 'tags' : category , 'publisher' : publisher , 'language' : language , 'linearize_tables' : True } keep_only_tags = [ dict(name='div', attrs={'class':['Column1']}) ] feeds = [(u'Articles', u'http://feeds.feedburner.com/CrackedRSS')] def get_article_url(self, article): return article.get('guid', None) def cleanup_page(self, soup): for item in soup.findAll(style=True): del item['style'] for alink in soup.findAll('a'): if alink.string is not None: tstr = alink.string alink.replaceWith(tstr) for div_to_remove in soup.findAll('div', attrs={'id':['googlead_1','fb-like-article','comments_section']}): div_to_remove.extract() for div_to_remove in soup.findAll('div', attrs={'class':['share_buttons_col_1','GenericModule1']}): div_to_remove.extract() for div_to_remove in soup.findAll('div', attrs={'class':re.compile("prev_next")}): div_to_remove.extract() for ul_to_remove in soup.findAll('ul', attrs={'class':['Nav6']}): ul_to_remove.extract() for image in soup.findAll('img', attrs={'alt': 'article image'}): image.extract() def append_page(self, soup, appendtag, position): pager = soup.find('a',attrs={'class':'next_arrow_active'}) if pager: nexturl = self.INDEX + pager['href'] soup2 = self.index_to_soup(nexturl) texttag = soup2.find('div', attrs={'class':re.compile("userStyled")}) newpos = len(texttag.contents) self.append_page(soup2,texttag,newpos) texttag.extract() self.cleanup_page(appendtag) appendtag.insert(position,texttag) else: self.cleanup_page(appendtag) def preprocess_html(self, soup): self.append_page(soup, soup.body, 3) return self.adeify_images(soup) |
04-19-2011, 03:51 AM | #7 |
Member
Posts: 18
Karma: 716
Join Date: Jun 2009
Location: San Francisco, CA, USA
Device: Astak EZReader, Sony PRS-350
|
So...did this recipe break or is it just me? I tried the built-in, tried this one...and nada (okay, summary/blurb but no full article).
What I want and what it /looks/ like should happen is the starting menu, the articles menu/summary/blurb, then after that the actual articles in their entirety. What I get: the starting menu, the summary/blurb bits, and "blank" pages except for header/footer from Calibre and a link to where the article was "downloaded" from. If I follow the link shown, I go to the Cracked site and the article appears without any sort of errors or issues...but that's on my PC, with a browser and internet connection, I can't do that on my PRS-350. I have no clue whatsoever about recipes, I've managed to modify a couple of builtins (mostly commenting/uncommenting the ones that are intended to be modified per user wants) but trying to work out what all is going on in this complex of one...so very not my skillset. There are other recipes that I have this same issue with, but I'm still mucking about with them to see if I can easily work out the problem, but the recipes are all mostly uncomprehensible to me - even with the tutorials out there, I am not a programmer, I've tried to learn a number of programming "things" (languages, partial subsets of instructions, and so on) in my past and my brain just looks at it and goes..."interesting...what is it? It seems to be words but they make no sense" (and that's looking at the how-tos and guides, nevermind what code samples usually do to my brain) Someone smarter than I am with programming and recipes...help? Please? I don't even know what info I should give you if it is a problem on my end somehow. |
04-19-2011, 10:13 AM | #8 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
|
|
04-19-2011, 05:47 PM | #9 | |
Member
Posts: 18
Karma: 716
Join Date: Jun 2009
Location: San Francisco, CA, USA
Device: Astak EZReader, Sony PRS-350
|
Quote:
It seems that is /should/ be grabbing the articles as that is what a number of recipes grab and it appears that it is the intended behaviour, but for whatever reason they are broken in some way that means it isn't grabbing the full text of the articles - the question more importantly is...how do I fix it with literally no understanding of the advanced code portion of modifying a recipe? I know the typical answer to that is probably "learn the code enough to manipulate it" but I do wish to stress that it isn't that I don't /want/ to learn how to manipulate the code, but for whatever reason no matter how badly I wish to learn, my brain is like teflon when it comes to anything related to programming - the information I read never gets absorbed enough for me to make use ot it. And I do come from a technical background, so you'd think I'd at least comprehend the basic gist of it having been exposed to it so much. Anyway, I'm having this issue with at least Salon, Jezebel, Cracked, and who knows what others I'd like to read if I could, so it would behoove me to figure out how to fix them myself, but barring that, is anyone willing to take a crack at em and see if they just need a simple sort of fix? If not, I'll cope, I haven't been reading them on my previous eReader until trying now, so it isn't like it is the end of the world if I can't have them work, I'm just all excited that I can read some of my news stuff on my new sony prs350 and I want to read more. |
|
04-20-2011, 08:55 AM | #10 | ||
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Yes. The recipe has to be fixed. Hopefully the recipe author will do it. If not, you or anyone else can tackle it.
Quote:
Quote:
|
||
04-20-2011, 04:00 PM | #11 | |
Member
Posts: 18
Karma: 716
Join Date: Jun 2009
Location: San Francisco, CA, USA
Device: Astak EZReader, Sony PRS-350
|
Quote:
|
|
04-29-2011, 11:56 AM | #12 |
Junior Member
Posts: 2
Karma: 10
Join Date: Jan 2011
Device: Sony PRS-650
|
Hi guys,
I have to say thanks to Nudge for getting this working in the first place. I managed to get a couple of downloads before they changed the website! So I have tried again to get this recipe working, I thought it might be a simple case of changing the tags. However it looks like it is not so simple... Here is as far as I have got with Nudge's recipe: Code:
from calibre.web.feeds.news import BasicNewsRecipe import re class Cracked(BasicNewsRecipe): title = u'Cracked.com' __author__ = u'Nudgenudge' language = 'en' description = 'America''s Only Humor and Video Site, since 1958' publisher = 'Cracked' category = 'comedy, lists' oldest_article = 2 delay = 10 max_articles_per_feed = 2 no_stylesheets = True encoding = 'cp1252' remove_javascript = True use_embedded_content = False INDEX = u'http://www.cracked.com' extra_css = """ .pageheader_type{font-size: x-large; font-weight: bold; color: #828D74} .pageheader_title{font-size: xx-large; color: #394128} .pageheader_byline{font-size: small; font-weight: bold; color: #394128} .score_bg {display: inline; width: 100%; margin-bottom: 2em} .score_column_1{ padding-left: 10px; font-size: small; width: 50%} .score_column_2{ padding-left: 10px; font-size: small; width: 50%} .score_column_3{ padding-left: 10px; font-size: small; width: 50%} .score_header{font-size: large; color: #50544A} .bodytext{display: block} body{font-family: Helvetica,Arial,sans-serif} """ conversion_options = { 'comment' : description , 'tags' : category , 'publisher' : publisher , 'language' : language , 'linearize_tables' : True } keep_only_tags = [ dict(name='section', attrs={'class':['body']}) ] feeds = [(u'Articles', u'http://feeds.feedburner.com/CrackedRSS')] def get_article_url(self, article): return article.get('guid', None) def cleanup_page(self, soup): for item in soup.findAll(style=True): del item['style'] for alink in soup.findAll('a'): if alink.string is not None: tstr = alink.string alink.replaceWith(tstr) for div_to_remove in soup.findAll('div', attrs={'id':['persistent-share','inline-share-buttons']}): div_to_remove.extract() for div_to_remove in soup.findAll('div', attrs={'class':['FacebookLike','shareBar']}): div_to_remove.extract() for nav_to_remove in soup.findAll('nav', attrs={'class':re.compile("PaginationContent")}): nav_to_remove.extract() for image in soup.findAll('img', attrs={'alt': 'article image'}): image.extract() def append_page(self, soup, appendtag, position): pager = soup.find('a',attrs={'class':'next'}) if pager: nexturl = self.INDEX + pager['href'] soup2 = self.index_to_soup(nexturl) texttag = soup2.find('Article', attrs={'class':re.compile("Article Module")}) newpos = len(texttag.contents) self.append_page(soup2,texttag,newpos) texttag.extract() self.cleanup_page(appendtag) appendtag.insert(position,texttag) else: self.cleanup_page(appendtag) def preprocess_html(self, soup): self.append_page(soup, soup.body, 3) return self.adeify_images(soup) Code:
from calibre.web.feeds.news import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup from calibre import strftime import re class Cracked(BasicNewsRecipe): title = u'Cracked.com' __author__ = u'Nudgenudge' language = 'en' description = 'America''s Only Humor and Video Site, since 1958' publisher = 'Cracked' category = 'comedy, lists' oldest_article = 2 delay = 10 max_articles_per_feed = 2 no_stylesheets = True encoding = 'cp1252' remove_javascript = True use_embedded_content = False INDEX = u'http://www.cracked.com' extra_css = """ .pageheader_type{font-size: x-large; font-weight: bold; color: #828D74} .pageheader_title{font-size: xx-large; color: #394128} .pageheader_byline{font-size: small; font-weight: bold; color: #394128} .score_bg {display: inline; width: 100%; margin-bottom: 2em} .score_column_1{ padding-left: 10px; font-size: small; width: 50%} .score_column_2{ padding-left: 10px; font-size: small; width: 50%} .score_column_3{ padding-left: 10px; font-size: small; width: 50%} .score_header{font-size: large; color: #50544A} .bodytext{display: block} body{font-family: Helvetica,Arial,sans-serif} """ conversion_options = { 'comment' : description , 'tags' : category , 'publisher' : publisher , 'language' : language , 'linearize_tables' : True } keep_only_tags = [ dict(name='section', attrs={'class':['body']}) ] def parse_index(self): articles = [] rawc = self.index_to_soup('http://www.cracked.com/funny-articles.html',True) soup = BeautifulSoup(rawc,fromEncoding=self.encoding) for item in soup.findAll(attrs={'class':'content'}): description = '' title_prefix = '' feed_link = item.find('a',href=True) descript = item.find('a') if descript: description = self.tag_to_string(descript) if feed_link: url = feed_link['href'] title = title_prefix + self.tag_to_string(feed_link) date = strftime(self.timefmt) articles.append({ 'title' :title ,'date' :date ,'url' :url ,'description':description }) return [(self.tag_to_string(soup.find('title')), articles)] def get_article_url(self, article): return article.get('guid', None) def cleanup_page(self, soup): for item in soup.findAll(style=True): del item['style'] for alink in soup.findAll('a'): if alink.string is not None: tstr = alink.string alink.replaceWith(tstr) for div_to_remove in soup.findAll('div', attrs={'id':['persistent-share','inline-share-buttons']}): div_to_remove.extract() for div_to_remove in soup.findAll('div', attrs={'class':['FacebookLike','shareBar']}): div_to_remove.extract() for nav_to_remove in soup.findAll('nav', attrs={'class':re.compile("PaginationContent")}): nav_to_remove.extract() for image in soup.findAll('img', attrs={'alt': 'article image'}): image.extract() def append_page(self, soup, appendtag, position): pager = soup.find('a',attrs={'class':'next'}) if pager: nexturl = self.INDEX + pager['href'] soup2 = self.index_to_soup(nexturl) texttag = soup2.find('Article', attrs={'class':re.compile("Article Module")}) newpos = len(texttag.contents) self.append_page(soup2,texttag,newpos) texttag.extract() self.cleanup_page(appendtag) appendtag.insert(position,texttag) else: self.cleanup_page(appendtag) def preprocess_html(self, soup): self.append_page(soup, soup.body, 3) return self.adeify_images(soup) Last edited by limnoski; 04-29-2011 at 02:41 PM. Reason: used quotes instead of code |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
onion recipe | Phoul | Recipes | 0 | 12-19-2010 02:56 PM |
Screen Cracked | omro | Astak EZReader | 13 | 05-07-2010 11:39 AM |
DH cracked my K2 | lala | Amazon Kindle | 6 | 02-22-2010 04:43 PM |
Cracked my SmartQV7 | Renji | Alternative Devices | 1 | 12-25-2009 04:57 PM |
Unutterably Silly Adults Go Wild Over Latest In Childrens Picture Book Series - The Onion | AprilHare | Lounge | 24 | 12-14-2009 11:09 AM |