Quote:
Originally Posted by schnortz
The recipe I am using is the following (modified with your suggested change, even if it was unsuccessful). Hope I'm not violating etiquette by posting the code.
|
It's not an etiquette violation to post it, that's what this thread is for. However, if you would post it again, but this time, put the code tags around it (use the hash # symbol) it will preserve the indents, and I'll test it for you.
If you really want to be nice to the thread, also use the spoiler tag (eye with an X) that will collapse it to take less space.
Quote:
And as requested... here is a link
|
I looked at the links. Your preprocess_regexps looks basically correct now, but you have some minor differences from the way I normally use it. Possibly that is the problem now, or maybe it's caused by the browser. I'm not sure how you viewed the source, but sometimes a browser changes things slightly so you don't get a match. The best way to do it if you have a problem is to print the soup in postprocess_html. I'll test that if you post your recipe with the indents in code tags.
As to the "Photo" issue - you want to skip articles that have that text in the link. I only know one way to do that. Perhaps someone else knows another. Basically, I know two ways to follow articles - to follow
all the links in the automatically parsed feed, or to build your own feed (without the Photo links) with parse_index and then follow all of those links.
If there's another way - to follow some links, but not others, I don't know it. As I posted, I had hoped at one time that filter_regexps would do that job, but I never got it to work. I suspect that it only works on recursed links, not the main article link.
Do you want details on how to use parse_index? Either way, you should
start here.