12-15-2009, 02:35 PM | #1 |
Junior Member
Posts: 9
Karma: 10
Join Date: Dec 2009
Device: Nook
|
Help with Boston Globe RSS recipe
I am having trouble getting going making a Boston Globe (boston.com) recipe.
I would probably be happy to use the RSS feed version rather than a full custom version, at least to start, but I am having an issue there. Basically, on the RSS site, the link pointed to is a "garbled" link such as: http://feeds.boston.com/click.phdo?i...888976244b67bf This link resolves to: http://www.boston.com/news/health/ar...id=Top+Stories Calibre on its own does not handle this properly, and I don't know how to "substitute" the real link for the "garbled" link. Also, I would really like the print version with is one step further removed at: http://www.boston.com/news/health/ar...enough?mode=PF After I get this working I may try to do something more fancy with a "full custom" version like the NYTimes example on the site. The issue with this is that the classes used on the Globe site are not nice like the Times site. Any help on either mechanism would be appreciated. Scott |
12-15-2009, 05:24 PM | #2 |
creator of calibre
Posts: 44,337
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Add
Code:
def get_article_url(self, a): return a.get('guid').split('?')[0]+'?mode=PF' |
Advert | |
|
12-15-2009, 06:45 PM | #3 |
Junior Member
Posts: 9
Karma: 10
Join Date: Dec 2009
Device: Nook
|
Thanks for the help. This does part of what is desired... The issue still remains that the link on the RSS page looks like:
http://feeds.boston.com/click.phdo?i...427913ecb9a0d8 But, the link I need to work with looks like: http://www.boston.com/news/health/ar...id=Top+Stories Calibre does not follow the top link. The Ebook page does list it, and if I click in the ebook on the link it is displayed, but the content from the page does not make it into the ebook. Also, there is no way to add the "print only page" to this. Is it possible from the script to resolve the readable link from the click.phdo link listed above? Thanks again |
12-15-2009, 07:04 PM | #4 |
creator of calibre
Posts: 44,337
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
the rss feed contains both links, the code i posted will use the correct link.
|
12-15-2009, 07:17 PM | #5 |
Junior Member
Posts: 9
Karma: 10
Join Date: Dec 2009
Device: Nook
|
I apologize, I am sure I'm missing something stupid, and my Python is non-existent. Here is my code in total. It does not produce the desired results. What am I doing wrong? I will be moving along to the python tutorial next, so maybe that will give me the answers...
class AdvancedUserRecipe1260919720(BasicNewsRecipe): title = u'CCC' oldest_article = 7 max_articles_per_feed = 100 feeds = [(u'Boston Globe', u'http://feeds.boston.com/boston/topstories')] def get_article_url(self, a): return a.get('guid').split('?')[0]+'?mode=PF' |
Advert | |
|
12-16-2009, 11:16 PM | #6 |
Junior Member
Posts: 9
Karma: 10
Join Date: Dec 2009
Device: Nook
|
OK, so I have this working now, but there is a "new" issue. The Boston Globe is not my friend right now... The code now looks at the "real" link, not the pheedo link. That is good. It also adds the ?mode=PF to the end. The link now looks like:
http://www.boston.com/business/ticke...4.html?mode=PF If you go to that link, the Globe will strip the ?mode=PF off the back end, leaving you with: http://www.boston.com/business/ticke...merica_24.html If you click on the print icon on the web page, it will bring you back to the link I originally wanted to use. Any ideas how to work around this? |
12-17-2009, 06:00 AM | #7 | |
Connoisseur
Posts: 78
Karma: 192
Join Date: Nov 2009
Device: Sony PRS-600
|
Quote:
So adding that header in the browser's request might work but I can not find how to do that in the docs for Mechanize. |
|
12-17-2009, 07:38 AM | #8 |
Guru
Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
|
You all are just complicating things. This is a fully working recipe for boston.com, just fill in feeds you need.
|
12-17-2009, 08:33 AM | #9 |
Guru
Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
|
BTW Kovid linearize_tables never gives good results for epub. Would you consider using something like this as replacement?
Code:
def preprocess_html(self, soup): attribs = [ 'style','font','valign' ,'colspan','width','height' ,'rowspan','summary','align' ,'cellspacing','cellpadding' ,'frames','rules','border' ] for item in soup.body.findAll(name=['table','td','tr','th','caption','thead','tfoot','tbody','colgroup','col']): item.name = 'div' for attrib in attribs: if item.has_key(attrib): del item[attrib] return soup |
12-17-2009, 10:36 AM | #10 |
Junior Member
Posts: 9
Karma: 10
Join Date: Dec 2009
Device: Nook
|
Thanks for the help. This works great on the top stories rss feed, but does not work on any of the other feeds. An example is the "Patriots" feed. Here is the feeds line I used. The Top works great, the Patriots does not work (though the web address is fine in Chrome).
feeds = [ (u'Top', u'http://feeds.boston.com/boston/topstories'), (u'Patriots', u'http://feeds.boston.com/boston/sports/football/patriots') ] Edit: I found out some more information, Here it is. The top stories feed points to a link like: http://www.boston.com/......./?rss_id=Top+Stories while all others point to a feed like: http://www.boston.com/.......?rss_id...+Patriots+news Notice the missing slash before the ?rss_id. I think I can just change your partition statement to use rss_id as the replacement for /. Last edited by horsegoalie; 12-17-2009 at 10:49 AM. Reason: More information |
12-17-2009, 10:48 AM | #11 |
creator of calibre
Posts: 44,337
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
@darkom: That's pretty much what linearize_tables does currently
Code:
def linearize(self, root): for x in XPath('//h:table|//h:td|//h:tr|//h:th|//h:caption|' '//h:tbody|//h:tfoot|//h:thead|//h:colgroup|//h:col')(root): x.tag = XHTML('div') for attr in ('style', 'font', 'valign', 'colspan', 'width', 'height', 'rowspan', 'summary', 'align', 'cellspacing', 'cellpadding', 'frames', 'rules', 'border'): if attr in x.attrib: del x.attrib[attr] |
12-17-2009, 11:52 AM | #12 |
Junior Member
Posts: 9
Karma: 10
Join Date: Dec 2009
Device: Nook
|
Just wanted to finish up here with the globe RSS reader. I have this working now, the fix I mentioned above did work. The current version downloads a ton of feeds, I will probably break this into multiple books, but that is for later. Thanks for all the help.
Scott |
12-17-2009, 12:56 PM | #13 |
Guru
Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
|
Here is updated and optimized recipe for boston.com that works for all feeds.
|
12-17-2009, 12:59 PM | #14 | |
Guru
Posts: 800
Karma: 194644
Join Date: Dec 2007
Location: Argentina
Device: Kindle Voyage
|
Quote:
|
|
12-17-2009, 06:56 PM | #15 |
creator of calibre
Posts: 44,337
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I think it was being caused by the fact that linearize_tables was running after teh CSS flattening code, so some of the CSS was preserved (moved into a class) even though the attributes were deleted. Will be fixed in next release.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Calibre Globe and Mail recipe and Sony PRS-600 | elvenic | Calibre | 13 | 01-15-2010 12:06 PM |
Libraries should buy ebook readers (from The Boston Globe) | Nate the great | News | 16 | 12-23-2009 10:56 AM |
Boston Globe article titled "Nuance's OmniPage 17 has scan-to-Kindle feature" | Gerry | News | 9 | 06-07-2009 06:18 AM |
E Ink profile in Boston Globe | starrigger | News | 0 | 04-24-2009 02:47 PM |
Happy iRex iLiad users around the Globe | Alexander Turcic | iRex | 3 | 07-20-2006 10:23 AM |