Quote:
Originally Posted by somedayson
Getting even closer.
I can read all the articles now, but there's stuff before and after them that I'm picking up off the web site. I can't figure out how to
1. Get it to the print only page
2. Get the stuff at the beginning (really disruptive for reading) and the end (not as bad but would love to remove it)
Thanks for any assistance anyone can provide. I certainly wouldn't mind a little .rar pack with the answer in it either!
Grateful either way,
Matt
|
You stated you are getting the print only page. I don't think you actually were getting the printer friendly version for some reason. Anyway. What you need to do is something like this. I haven't fully tested it but it should work.
Also please in the future wrap your code in spoiler and code tags. it makes it easier for all of us here
Spoiler:
Code:
from calibre.web.feeds.news import BasicNewsRecipe
class AdvancedUserRecipe1282101454(BasicNewsRecipe):
title = 'FW'
language = 'en'
__author__ = 'TonytheBookworm'
description = 'FW'
publisher = 'Tony'
category = 'whateveryouwant'
oldest_article = 1
max_articles_per_feed = 100
no_stylesheets = True
remove_tags = [dict(name='div', attrs={'id':['sidebar1']})]
feeds = [(u'Opinion', u'http://journalgazette.net/apps/pbcs.dll/section?Category=EDIT&template=blogrss&mime=xml'),
(u'Local News',u'http://journalgazette.net/apps/pbcs.dll/section?Category=LOCAL&template=blogrss&mime=xml') ,
(u'Sports',u'http://journalgazette.net/apps/pbcs.dll/section?Category=SPORTS&template=blogrss&mime=xml' ),
(u'Features',u'http://journalgazette.net/apps/pbcs.dll/section?Category=FEAT&template=blogrss&mime=xml'),
(u'Business',u'http://journalgazette.net/apps/pbcs.dll/section?Category=BIZ&template=blogrss&mime=xml'),
(u'Ice Chips',u'http://journalgazette.net/apps/pbcs.dll/section?Category=BLOGS11&template=blogrss&mime=xml '),
(u'Entertainment',u'http://journalgazette.net/apps/pbcs.dll/section?Category=ENT&template=blogrss&mime=xml'),
(u'Food',u'http://journalgazette.net/apps/pbcs.dll/section?Category=FOOD&template=blogrss&mime=xml')
]
def print_version(self, url):
split1 = url.split("/")
print 'THE SPLIT IS: ', split1
url1 = split1[0]
url2 = split1[1]
url3 = split1[2]
url4 = split1[3]
url5 = split1[4]
url6 = split1[5]
url7 = split1[6]
url8 = split1[7]
#need to convert to print_version
#originalversion is : http://www.journalgazette.net/article/20100905/EDIT10/309059959/1021/EDIT
#printversion should be: http://www.journalgazette.net/apps/pbcs.dll/article?AID=/20100905/EDIT10/309059959/-1/EDIT01&template=printart
#results of the split
#THE SPLIT IS: [u'http:', u'', u'www.journalgazette.net', u'article', u'20100905', u'EDIT10', u'309059959', u'1021', u'EDIT']
print_url = 'http://' + url3 + '/apps/pbcs.dll/article?AID=/' + url5 + '/' + url6 + '/' + url7 + '/-1/EDIT01&template=printart'
print 'THIS URL WILL PRINT: ', print_url # this is a test string to see what the url is it will return
return print_url