Need Help with Recipe

UtahJames · 04-11-2011, 04:08 PM

Hello. I'm trying to get a recipe created for KSL. I need a bit of help and it looks like this is the spot that's giving me the trouble:

The CSS loads the page and each of the news items is under a div labled:
<div class="headlineQueueItem">. Well the problem is each of the links to the actual newstory use the tiniest bit of Javascript via an anchor tag <a ....?>. Please see below

<a onclick="s_objectID='Latest Local News 1 title'" href="?nid=148&sid=15103589">Alta High assistant principal takes different job</a>

All this really does is makes it so the part after the question mark in
href"?' is pasted after the text: 'http://www.ksl.com/index.php?' as shown below
http://www.ksl.com/index.php?nid=148&sid=15103589

Is there any way to make python turn the href="?nid=148&sid=15103589"="? into href="http://www.ksl.com/index.php?nid=148&sid=15103589" so I don't get deadlinks when the recipe is downloaded?

Thanks in advance.

James

P.S. - Here's my recipe in case this is helpful:

Code:

class AdvancedUserRecipe1300058293(BasicNewsRecipe):
    title          = u'KSL'
    oldest_article = 1
    max_articles_per_feed = 20

    remove_tags_after  = dict(name='div',attrs={'id':'bodyCol1'}),

    keep_only_tags = [dict(name='div',attrs={'id':'bodyBlock'})]
    remove_tags    = [
        dict(name='table',attrs={'class':'siteIndex'}),
        dict(name='div',attrs={'class':'roundColWide'}),
        dict(name='div',attrs={'id':'bodyCol2'}),
        dict(name='div',attrs={'id':'bodyCol3'}),
        dict(name='div',attrs={'class':'addthis_toolbox addthis_default_style'}),
        dict(name='embed',attrs={'id':'p1'}),
        dict(name=['script', 'noscript', 'style']),
        ]

    feeds          = [
        (u'Local News and Features', u'http://www.ksl.com/xml/148.rss'),
        (u'Consumer News', u'http://www.ksl.com/xml/172.rss'),
        ]

Starson17 · 04-12-2011, 09:50 AM

I'd use preprocess_regexps to change the <a> tag.

http://calibre-ebook.com/user_manual...rocess_regexps

Edit: Actually, if that didn't work, I'd switch to preprocess_html and run a regex on the <a> tag. I can't recall if preprocess_regexps runs early enough in the process.

And I'm not totally sure where your problem is - if it's in the RSS feed, then you'll need to work even earlier in the process. I'd do that by grabbing the feed page with parse_index and regex fixing the <a> links.

04-11-2011, 04:08 PM	#1
UtahJames Junior Member Posts: 8 Karma: 10 Join Date: Mar 2011 Device: Kindle 3	Need Help with Recipe Hello. I'm trying to get a recipe created for KSL. I need a bit of help and it looks like this is the spot that's giving me the trouble: The CSS loads the page and each of the news items is under a div labled: <div class="headlineQueueItem">. Well the problem is each of the links to the actual newstory use the tiniest bit of Javascript via an anchor tag <a ....?>. Please see below <a onclick="s_objectID='Latest Local News 1 title'" href="?nid=148&sid=15103589">Alta High assistant principal takes different job</a> All this really does is makes it so the part after the question mark in href"?' is pasted after the text: 'http://www.ksl.com/index.php?' as shown below http://www.ksl.com/index.php?nid=148&sid=15103589 Is there any way to make python turn the href="?nid=148&sid=15103589"="? into href="http://www.ksl.com/index.php?nid=148&sid=15103589" so I don't get deadlinks when the recipe is downloaded? Thanks in advance. James P.S. - Here's my recipe in case this is helpful: Code: class AdvancedUserRecipe1300058293(BasicNewsRecipe): title = u'KSL' oldest_article = 1 max_articles_per_feed = 20 remove_tags_after = dict(name='div',attrs={'id':'bodyCol1'}), keep_only_tags = [dict(name='div',attrs={'id':'bodyBlock'})] remove_tags = [ dict(name='table',attrs={'class':'siteIndex'}), dict(name='div',attrs={'class':'roundColWide'}), dict(name='div',attrs={'id':'bodyCol2'}), dict(name='div',attrs={'id':'bodyCol3'}), dict(name='div',attrs={'class':'addthis_toolbox addthis_default_style'}), dict(name='embed',attrs={'id':'p1'}), dict(name=['script', 'noscript', 'style']), ] feeds = [ (u'Local News and Features', u'http://www.ksl.com/xml/148.rss'), (u'Consumer News', u'http://www.ksl.com/xml/172.rss'), ] Last edited by kovidgoyal; 04-11-2011 at 04:46 PM.

04-12-2011, 09:50 AM	#2
Starson17 Wizard Posts: 4,004 Karma: 177841 Join Date: Dec 2009 Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T	I'd use preprocess_regexps to change the <a> tag. http://calibre-ebook.com/user_manual...rocess_regexps Edit: Actually, if that didn't work, I'd switch to preprocess_html and run a regex on the <a> tag. I can't recall if preprocess_regexps runs early enough in the process. And I'm not totally sure where your problem is - if it's in the RSS feed, then you'll need to work even earlier in the process. I'd do that by grabbing the feed page with parse_index and regex fixing the <a> links. Last edited by Starson17; 04-12-2011 at 09:56 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Recipe works when mocked up as Python file, fails when converted to Recipe	ode	Recipes	7	09-04-2011 04:57 AM
Recipe Please	gagw	Recipes	0	01-24-2011 07:24 AM
recipe please	Torx	Recipes	0	01-22-2011 12:18 PM
Recipe Help	lrain5	Calibre	3	05-09-2010 10:42 PM
Recipe Help Please	estral	Calibre	1	06-11-2009 02:35 PM

Advert