Quote:
Originally Posted by Ekips
This is my first ever attempt at python so excuse the roughness.
|
I'm a beginner, too. Kovid's been riding herd on my efforts, but I'll see if I can help you.
Your recipe looks pretty good. Minor cleanup: You might want to change the def print_version to this:
Code:
def print_version(self, url):
url.replace('?OTC-RSS&ATTR=News', '?print=yes')
url.replace('?OTC-RSS&ATTR=Royals', '?print=yes')
url.replace('?OTC-RSS&ATTR=Gizmo', '?print=yes')
url.replace('?OTC-RSS&ATTR=Boxing', '?print=yes')
url.replace('?OTC-RSS&ATTR=Cricket', '?print=yes')
url.replace('?OTC-RSS&ATTR=Football', '?print=yes')
url.replace('?OTC-RSS&ATTR=Rugby+Union', '?print=yes')
url.replace('?OTC-RSS&ATTR=Tv', '?print=yes')
url.replace('?OTC-RSS&ATTR=Bizarre', '?print=yes')
url.replace('?OTC-RSS&ATTR=Usa', '?print=yes')
url.replace('?OTC-RSS&ATTR=Film', '?print=yes')
url.replace('?OTC-RSS&ATTR=HomePage', '?print=yes')
return url
Each replace() just modifies url, so you can do them sequentially in the body, and return url instead of doing a single modification of url in the return line.
I ran the recipe in test mode, so I only pulled two feeds with two articles each. I didn't see any references to Flash. I did see some text "Advertisement" and some "Add a Comment" links that were left. Can you tell me exactly what feed/article you want help on?
Add this to your remove_tags to kill the "Add a Comment" :
Code:
,dict(name='a', attrs={'class':'add_a_comment'})
Do you know the best way to find these?
Use Firefox,
install the Firebug add-on,
open the page you're having trouble with,
find the item you want to remove on the original page (CTRL-F),
right click that item and select "Inspect Element"
It tells you the name, and id or class label of the element.
Then just put that into your remove_tag list.
The "Add a Comment" junk was in an <a> tag with id='addComment' and class= 'add_a_comment'. You could pull it with reference to either the id or the class.
Also, you can condense your 3 removes into one. Here is the line:
Code:
dict(name='div', attrs={'class':['slideshow','float-left','ltbx-slideshow ltbx-btn-ss']})
The 3 keeps can be condensed the same way.
Last comment - I usually add "remove_javascript = True" unless there's some reason not to use it.