Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 12-15-2009, 02:35 PM   #1
horsegoalie
Junior Member
horsegoalie began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Dec 2009
Device: Nook
Help with Boston Globe RSS recipe

I am having trouble getting going making a Boston Globe (boston.com) recipe.

I would probably be happy to use the RSS feed version rather than a full custom version, at least to start, but I am having an issue there. Basically, on the RSS site, the link pointed to is a "garbled" link such as: http://feeds.boston.com/click.phdo?i...888976244b67bf
This link resolves to: http://www.boston.com/news/health/ar...id=Top+Stories

Calibre on its own does not handle this properly, and I don't know how to "substitute" the real link for the "garbled" link. Also, I would really like the print version with is one step further removed at:
http://www.boston.com/news/health/ar...enough?mode=PF

After I get this working I may try to do something more fancy with a "full custom" version like the NYTimes example on the site. The issue with this is that the classes used on the Globe site are not nice like the Times site. Any help on either mechanism would be appreciated.

Scott
horsegoalie is offline   Reply With Quote
Old 12-15-2009, 05:24 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,622
Karma: 4998447
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Add

Code:
def get_article_url(self, a):
   return a.get('guid').split('?')[0]+'?mode=PF'
to your recipe
kovidgoyal is online now   Reply With Quote
 
Enthusiast
Old 12-15-2009, 06:45 PM   #3
horsegoalie
Junior Member
horsegoalie began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Dec 2009
Device: Nook
Thanks for the help. This does part of what is desired... The issue still remains that the link on the RSS page looks like:
http://feeds.boston.com/click.phdo?i...427913ecb9a0d8
But, the link I need to work with looks like:
http://www.boston.com/news/health/ar...id=Top+Stories

Calibre does not follow the top link. The Ebook page does list it, and if I click in the ebook on the link it is displayed, but the content from the page does not make it into the ebook. Also, there is no way to add the "print only page" to this. Is it possible from the script to resolve the readable link from the click.phdo link listed above?

Thanks again
horsegoalie is offline   Reply With Quote
Old 12-15-2009, 07:04 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,622
Karma: 4998447
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
the rss feed contains both links, the code i posted will use the correct link.
kovidgoyal is online now   Reply With Quote
Old 12-15-2009, 07:17 PM   #5
horsegoalie
Junior Member
horsegoalie began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Dec 2009
Device: Nook
I apologize, I am sure I'm missing something stupid, and my Python is non-existent. Here is my code in total. It does not produce the desired results. What am I doing wrong? I will be moving along to the python tutorial next, so maybe that will give me the answers...

class AdvancedUserRecipe1260919720(BasicNewsRecipe):
title = u'CCC'
oldest_article = 7
max_articles_per_feed = 100

feeds = [(u'Boston Globe', u'http://feeds.boston.com/boston/topstories')]

def get_article_url(self, a):
return a.get('guid').split('?')[0]+'?mode=PF'
horsegoalie is offline   Reply With Quote
Old 12-16-2009, 11:16 PM   #6
horsegoalie
Junior Member
horsegoalie began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Dec 2009
Device: Nook
OK, so I have this working now, but there is a "new" issue. The Boston Globe is not my friend right now... The code now looks at the "real" link, not the pheedo link. That is good. It also adds the ?mode=PF to the end. The link now looks like:
http://www.boston.com/business/ticke...4.html?mode=PF
If you go to that link, the Globe will strip the ?mode=PF off the back end, leaving you with:
http://www.boston.com/business/ticke...merica_24.html
If you click on the print icon on the web page, it will bring you back to the link I originally wanted to use. Any ideas how to work around this?
horsegoalie is offline   Reply With Quote
Old 12-17-2009, 06:00 AM   #7
evanmaastrigt
Connoisseur
evanmaastrigt doesn't litterevanmaastrigt doesn't litter
 
Posts: 78
Karma: 192
Join Date: Nov 2009
Device: Sony PRS-600
Quote:
Originally Posted by horsegoalie View Post
If you go to that link, the Globe will strip the ?mode=PF off the back end, leaving you with:
http://www.boston.com/business/ticke...merica_24.html
If you click on the print icon on the web page, it will bring you back to the link I originally wanted to use. Any ideas how to work around this?
They are looking at the Referer header. If it is not set to the URL of the original page you do not get the print version. I set that header with Tamper Data on FireFox and got to the print version alright.

So adding that header in the browser's request might work but I can not find how to do that in the docs for Mechanize.
evanmaastrigt is offline   Reply With Quote
Old 12-17-2009, 07:38 AM   #8
kiklop74
Guru
kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.
 
kiklop74's Avatar
 
Posts: 780
Karma: 194642
Join Date: Dec 2007
Location: Argentina
Device: Kindle PaperWhite, Motorola Xoom
You all are just complicating things. This is a fully working recipe for boston.com, just fill in feeds you need.
Attached Files
File Type: zip boston.com.zip (932 Bytes, 99 views)
kiklop74 is offline   Reply With Quote
Old 12-17-2009, 08:33 AM   #9
kiklop74
Guru
kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.
 
kiklop74's Avatar
 
Posts: 780
Karma: 194642
Join Date: Dec 2007
Location: Argentina
Device: Kindle PaperWhite, Motorola Xoom
BTW Kovid linearize_tables never gives good results for epub. Would you consider using something like this as replacement?

Code:
    def preprocess_html(self, soup):
        attribs = [  'style','font','valign'
                    ,'colspan','width','height'
                    ,'rowspan','summary','align'
                    ,'cellspacing','cellpadding'
                    ,'frames','rules','border'
                  ]
        for item in soup.body.findAll(name=['table','td','tr','th','caption','thead','tfoot','tbody','colgroup','col']):
            item.name = 'div'
            for attrib in attribs:
                if item.has_key(attrib):
                   del item[attrib]
        return soup
kiklop74 is offline   Reply With Quote
Old 12-17-2009, 10:36 AM   #10
horsegoalie
Junior Member
horsegoalie began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Dec 2009
Device: Nook
Thanks for the help. This works great on the top stories rss feed, but does not work on any of the other feeds. An example is the "Patriots" feed. Here is the feeds line I used. The Top works great, the Patriots does not work (though the web address is fine in Chrome).

feeds = [
(u'Top', u'http://feeds.boston.com/boston/topstories'),
(u'Patriots', u'http://feeds.boston.com/boston/sports/football/patriots')
]

Edit:
I found out some more information, Here it is. The top stories feed points to a link like:
http://www.boston.com/......./?rss_id=Top+Stories
while all others point to a feed like:
http://www.boston.com/.......?rss_id...+Patriots+news

Notice the missing slash before the ?rss_id. I think I can just change your partition statement to use rss_id as the replacement for /.

Last edited by horsegoalie; 12-17-2009 at 10:49 AM. Reason: More information
horsegoalie is offline   Reply With Quote
Old 12-17-2009, 10:48 AM   #11
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,622
Karma: 4998447
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
@darkom: That's pretty much what linearize_tables does currently

Code:
 def linearize(self, root):
        for x in XPath('//h:table|//h:td|//h:tr|//h:th|//h:caption|'
                '//h:tbody|//h:tfoot|//h:thead|//h:colgroup|//h:col')(root):
            x.tag = XHTML('div')
            for attr in ('style', 'font', 'valign',
                         'colspan', 'width', 'height',
                         'rowspan', 'summary', 'align',
                         'cellspacing', 'cellpadding',
                         'frames', 'rules', 'border'):
                if attr in x.attrib:
                    del x.attrib[attr]
kovidgoyal is online now   Reply With Quote
Old 12-17-2009, 11:52 AM   #12
horsegoalie
Junior Member
horsegoalie began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Dec 2009
Device: Nook
Just wanted to finish up here with the globe RSS reader. I have this working now, the fix I mentioned above did work. The current version downloads a ton of feeds, I will probably break this into multiple books, but that is for later. Thanks for all the help.

Scott
horsegoalie is offline   Reply With Quote
Old 12-17-2009, 12:56 PM   #13
kiklop74
Guru
kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.
 
kiklop74's Avatar
 
Posts: 780
Karma: 194642
Join Date: Dec 2007
Location: Argentina
Device: Kindle PaperWhite, Motorola Xoom
Here is updated and optimized recipe for boston.com that works for all feeds.
Attached Files
File Type: zip boston.com.zip (854 Bytes, 91 views)
kiklop74 is offline   Reply With Quote
Old 12-17-2009, 12:59 PM   #14
kiklop74
Guru
kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.kiklop74 can program the VCR without an owner's manual.
 
kiklop74's Avatar
 
Posts: 780
Karma: 194642
Join Date: Dec 2007
Location: Argentina
Device: Kindle PaperWhite, Motorola Xoom
Quote:
Originally Posted by kovidgoyal View Post
@darkom: That's pretty much what linearize_tables does currently

Code:
 def linearize(self, root):
        for x in XPath('//h:table|//h:td|//h:tr|//h:th|//h:caption|'
                '//h:tbody|//h:tfoot|//h:thead|//h:colgroup|//h:col')(root):
            x.tag = XHTML('div')
            for attr in ('style', 'font', 'valign',
                         'colspan', 'width', 'height',
                         'rowspan', 'summary', 'align',
                         'cellspacing', 'cellpadding',
                         'frames', 'rules', 'border'):
                if attr in x.attrib:
                    del x.attrib[attr]
Well something is not being done right. For example if you take boston.com recipe I just posted (which has tables), remove keep_only_tags and add linearize_tables options you will see that generated epub displays incorrectly in adobe DE. However if you add the part for removing tables I posted than generated epub displays correctly in adobe DE and in sony reader. I suggest you compare the output to see what is the difference and thus perhaps improve the code or something.
kiklop74 is offline   Reply With Quote
Old 12-17-2009, 06:56 PM   #15
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,622
Karma: 4998447
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I think it was being caused by the fact that linearize_tables was running after teh CSS flattening code, so some of the CSS was preserved (moved into a class) even though the attributes were deleted. Will be fixed in next release.
kovidgoyal is online now   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Calibre Globe and Mail recipe and Sony PRS-600 elvenic Calibre 13 01-15-2010 12:06 PM
Libraries should buy ebook readers (from The Boston Globe) Nate the great News 16 12-23-2009 10:56 AM
Boston Globe article titled "Nuance's OmniPage 17 has scan-to-Kindle feature" Gerry News 9 06-07-2009 06:18 AM
E Ink profile in Boston Globe starrigger News 0 04-24-2009 02:47 PM
Happy iRex iLiad users around the Globe Alexander Turcic iRex 3 07-20-2006 10:23 AM


All times are GMT -4. The time now is 04:19 AM.


MobileRead.com is a privately owned, operated and funded community.