Help with Boston Globe RSS recipe

horsegoalie · 12-15-2009, 02:35 PM

I am having trouble getting going making a Boston Globe (boston.com) recipe.

I would probably be happy to use the RSS feed version rather than a full custom version, at least to start, but I am having an issue there. Basically, on the RSS site, the link pointed to is a "garbled" link such as: http://feeds.boston.com/click.phdo?i...888976244b67bf
This link resolves to: http://www.boston.com/news/health/ar...id=Top+Stories

Calibre on its own does not handle this properly, and I don't know how to "substitute" the real link for the "garbled" link. Also, I would really like the print version with is one step further removed at:
http://www.boston.com/news/health/ar...enough?mode=PF

After I get this working I may try to do something more fancy with a "full custom" version like the NYTimes example on the site. The issue with this is that the classes used on the Globe site are not nice like the Times site. Any help on either mechanism would be appreciated.

Scott

kovidgoyal · 12-15-2009, 05:24 PM

Add

Code:

def get_article_url(self, a):
   return a.get('guid').split('?')[0]+'?mode=PF'

to your recipe

horsegoalie · 12-15-2009, 06:45 PM

Thanks for the help. This does part of what is desired... The issue still remains that the link on the RSS page looks like:
http://feeds.boston.com/click.phdo?i...427913ecb9a0d8
But, the link I need to work with looks like:
http://www.boston.com/news/health/ar...id=Top+Stories

Calibre does not follow the top link. The Ebook page does list it, and if I click in the ebook on the link it is displayed, but the content from the page does not make it into the ebook. Also, there is no way to add the "print only page" to this. Is it possible from the script to resolve the readable link from the click.phdo link listed above?

Thanks again

kovidgoyal · 12-15-2009, 07:04 PM

the rss feed contains both links, the code i posted will use the correct link.

horsegoalie · 12-15-2009, 07:17 PM

I apologize, I am sure I'm missing something stupid, and my Python is non-existent. Here is my code in total. It does not produce the desired results. What am I doing wrong? I will be moving along to the python tutorial next, so maybe that will give me the answers...

class AdvancedUserRecipe1260919720(BasicNewsRecipe):
title = u'CCC'
oldest_article = 7
max_articles_per_feed = 100

feeds = [(u'Boston Globe', u'http://feeds.boston.com/boston/topstories')]

def get_article_url(self, a):
return a.get('guid').split('?')[0]+'?mode=PF'

horsegoalie · 12-16-2009, 11:16 PM

OK, so I have this working now, but there is a "new" issue. The Boston Globe is not my friend right now... The code now looks at the "real" link, not the pheedo link. That is good. It also adds the ?mode=PF to the end. The link now looks like:
http://www.boston.com/business/ticke...4.html?mode=PF
If you go to that link, the Globe will strip the ?mode=PF off the back end, leaving you with:
http://www.boston.com/business/ticke...merica_24.html
If you click on the print icon on the web page, it will bring you back to the link I originally wanted to use. Any ideas how to work around this?

evanmaastrigt · 12-17-2009, 06:00 AM

Quote:

Originally Posted by horsegoalie

If you go to that link, the Globe will strip the ?mode=PF off the back end, leaving you with:
http://www.boston.com/business/ticke...merica_24.html
If you click on the print icon on the web page, it will bring you back to the link I originally wanted to use. Any ideas how to work around this?

They are looking at the Referer header. If it is not set to the URL of the original page you do not get the print version. I set that header with Tamper Data on FireFox and got to the print version alright.

So adding that header in the browser's request might work but I can not find how to do that in the docs for Mechanize.

kiklop74 · 12-17-2009, 07:38 AM

You all are just complicating things. This is a fully working recipe for boston.com, just fill in feeds you need.

kiklop74 · 12-17-2009, 08:33 AM

BTW Kovid linearize_tables never gives good results for epub. Would you consider using something like this as replacement?

Code:

    def preprocess_html(self, soup):
        attribs = [  'style','font','valign'
                    ,'colspan','width','height'
                    ,'rowspan','summary','align'
                    ,'cellspacing','cellpadding'
                    ,'frames','rules','border'
                  ]
        for item in soup.body.findAll(name=['table','td','tr','th','caption','thead','tfoot','tbody','colgroup','col']):
            item.name = 'div'
            for attrib in attribs:
                if item.has_key(attrib):
                   del item[attrib]
        return soup

horsegoalie · 12-17-2009, 10:36 AM

Thanks for the help. This works great on the top stories rss feed, but does not work on any of the other feeds. An example is the "Patriots" feed. Here is the feeds line I used. The Top works great, the Patriots does not work (though the web address is fine in Chrome).

feeds = [
(u'Top', u'http://feeds.boston.com/boston/topstories'),
(u'Patriots', u'http://feeds.boston.com/boston/sports/football/patriots')
]

Edit:
I found out some more information, Here it is. The top stories feed points to a link like:
http://www.boston.com/......./?rss_id=Top+Stories
while all others point to a feed like:
http://www.boston.com/.......?rss_id...+Patriots+news

Notice the missing slash before the ?rss_id. I think I can just change your partition statement to use rss_id as the replacement for /.

kovidgoyal · 12-17-2009, 10:48 AM

@darkom: That's pretty much what linearize_tables does currently

Code:

 def linearize(self, root):
        for x in XPath('//h:table|//h:td|//h:tr|//h:th|//h:caption|'
                '//h:tbody|//h:tfoot|//h:thead|//h:colgroup|//h:col')(root):
            x.tag = XHTML('div')
            for attr in ('style', 'font', 'valign',
                         'colspan', 'width', 'height',
                         'rowspan', 'summary', 'align',
                         'cellspacing', 'cellpadding',
                         'frames', 'rules', 'border'):
                if attr in x.attrib:
                    del x.attrib[attr]

horsegoalie · 12-17-2009, 11:52 AM

Just wanted to finish up here with the globe RSS reader. I have this working now, the fix I mentioned above did work. The current version downloads a ton of feeds, I will probably break this into multiple books, but that is for later. Thanks for all the help.

Scott

kiklop74 · 12-17-2009, 12:56 PM

Here is updated and optimized recipe for boston.com that works for all feeds.

kiklop74 · 12-17-2009, 12:59 PM

Quote:

Originally Posted by kovidgoyal

@darkom: That's pretty much what linearize_tables does currently

Code:

 def linearize(self, root):
        for x in XPath('//h:table|//h:td|//h:tr|//h:th|//h:caption|'
                '//h:tbody|//h:tfoot|//h:thead|//h:colgroup|//h:col')(root):
            x.tag = XHTML('div')
            for attr in ('style', 'font', 'valign',
                         'colspan', 'width', 'height',
                         'rowspan', 'summary', 'align',
                         'cellspacing', 'cellpadding',
                         'frames', 'rules', 'border'):
                if attr in x.attrib:
                    del x.attrib[attr]

Well something is not being done right. For example if you take boston.com recipe I just posted (which has tables), remove keep_only_tags and add linearize_tables options you will see that generated epub displays incorrectly in adobe DE. However if you add the part for removing tables I posted than generated epub displays correctly in adobe DE and in sony reader. I suggest you compare the output to see what is the difference and thus perhaps improve the code or something.

kovidgoyal · 12-17-2009, 06:56 PM

I think it was being caused by the fact that linearize_tables was running after teh CSS flattening code, so some of the CSS was preserved (moved into a class) even though the attributes were deleted. Will be fixed in next release.

12-15-2009, 02:35 PM	#1
horsegoalie Junior Member Posts: 9 Karma: 10 Join Date: Dec 2009 Device: Nook	Help with Boston Globe RSS recipe I am having trouble getting going making a Boston Globe (boston.com) recipe. I would probably be happy to use the RSS feed version rather than a full custom version, at least to start, but I am having an issue there. Basically, on the RSS site, the link pointed to is a "garbled" link such as: http://feeds.boston.com/click.phdo?i...888976244b67bf This link resolves to: http://www.boston.com/news/health/ar...id=Top+Stories Calibre on its own does not handle this properly, and I don't know how to "substitute" the real link for the "garbled" link. Also, I would really like the print version with is one step further removed at: http://www.boston.com/news/health/ar...enough?mode=PF After I get this working I may try to do something more fancy with a "full custom" version like the NYTimes example on the site. The issue with this is that the classes used on the Globe site are not nice like the Times site. Any help on either mechanism would be appreciated. Scott

12-15-2009, 05:24 PM	#2
kovidgoyal creator of calibre Posts: 44,337 Karma: 23661992 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Add Code: def get_article_url(self, a): return a.get('guid').split('?')[0]+'?mode=PF' to your recipe

12-17-2009, 10:36 AM	#10
horsegoalie Junior Member Posts: 9 Karma: 10 Join Date: Dec 2009 Device: Nook	Thanks for the help. This works great on the top stories rss feed, but does not work on any of the other feeds. An example is the "Patriots" feed. Here is the feeds line I used. The Top works great, the Patriots does not work (though the web address is fine in Chrome). feeds = [ (u'Top', u'http://feeds.boston.com/boston/topstories'), (u'Patriots', u'http://feeds.boston.com/boston/sports/football/patriots') ] Edit: I found out some more information, Here it is. The top stories feed points to a link like: http://www.boston.com/......./?rss_id=Top+Stories while all others point to a feed like: http://www.boston.com/.......?rss_id...+Patriots+news Notice the missing slash before the ?rss_id. I think I can just change your partition statement to use rss_id as the replacement for /. Last edited by horsegoalie; 12-17-2009 at 10:49 AM. Reason: More information

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Calibre Globe and Mail recipe and Sony PRS-600	elvenic	Calibre	13	01-15-2010 12:06 PM
Libraries should buy ebook readers (from The Boston Globe)	Nate the great	News	16	12-23-2009 10:56 AM
Boston Globe article titled "Nuance's OmniPage 17 has scan-to-Kindle feature"	Gerry	News	9	06-07-2009 06:18 AM
E Ink profile in Boston Globe	starrigger	News	0	04-24-2009 02:47 PM
Happy iRex iLiad users around the Globe	Alexander Turcic	iRex	3	07-20-2006 10:23 AM

12-15-2009, 06:45 PM	#3
horsegoalie Junior Member Posts: 9 Karma: 10 Join Date: Dec 2009 Device: Nook	Thanks for the help. This does part of what is desired... The issue still remains that the link on the RSS page looks like: http://feeds.boston.com/click.phdo?i...427913ecb9a0d8 But, the link I need to work with looks like: http://www.boston.com/news/health/ar...id=Top+Stories Calibre does not follow the top link. The Ebook page does list it, and if I click in the ebook on the link it is displayed, but the content from the page does not make it into the ebook. Also, there is no way to add the "print only page" to this. Is it possible from the script to resolve the readable link from the click.phdo link listed above? Thanks again

12-15-2009, 07:04 PM	#4
kovidgoyal creator of calibre Posts: 44,337 Karma: 23661992 Join Date: Oct 2006 Location: Mumbai, India Device: Various	the rss feed contains both links, the code i posted will use the correct link.

12-15-2009, 07:17 PM	#5
horsegoalie Junior Member Posts: 9 Karma: 10 Join Date: Dec 2009 Device: Nook	I apologize, I am sure I'm missing something stupid, and my Python is non-existent. Here is my code in total. It does not produce the desired results. What am I doing wrong? I will be moving along to the python tutorial next, so maybe that will give me the answers... class AdvancedUserRecipe1260919720(BasicNewsRecipe): title = u'CCC' oldest_article = 7 max_articles_per_feed = 100 feeds = [(u'Boston Globe', u'http://feeds.boston.com/boston/topstories')] def get_article_url(self, a): return a.get('guid').split('?')[0]+'?mode=PF'

12-16-2009, 11:16 PM	#6
horsegoalie Junior Member Posts: 9 Karma: 10 Join Date: Dec 2009 Device: Nook	OK, so I have this working now, but there is a "new" issue. The Boston Globe is not my friend right now... The code now looks at the "real" link, not the pheedo link. That is good. It also adds the ?mode=PF to the end. The link now looks like: http://www.boston.com/business/ticke...4.html?mode=PF If you go to that link, the Globe will strip the ?mode=PF off the back end, leaving you with: http://www.boston.com/business/ticke...merica_24.html If you click on the print icon on the web page, it will bring you back to the link I originally wanted to use. Any ideas how to work around this?

12-17-2009, 11:52 AM	#12
horsegoalie Junior Member Posts: 9 Karma: 10 Join Date: Dec 2009 Device: Nook	Just wanted to finish up here with the globe RSS reader. I have this working now, the fix I mentioned above did work. The current version downloads a ton of feeds, I will probably break this into multiple books, but that is for later. Thanks for all the help. Scott

12-17-2009, 06:56 PM	#15
kovidgoyal creator of calibre Posts: 44,337 Karma: 23661992 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I think it was being caused by the fact that linearize_tables was running after teh CSS flattening code, so some of the CSS was preserved (moved into a class) even though the attributes were deleted. Will be fixed in next release.

Advert

Advert