Quote:
Originally Posted by TonytheBookworm
1) For whatever reason I always get a full run of the whole page as an article not sure why this is unless it searches for artIntroShort and then the <a> tags and doesn't find any (the webmaster isn't consistent) so as a result My guess is somehow (I can't seem to find it in my output log) BUT it takes and link['href'] ends up being NONE so the url ends up just being the INDEX.
|
You've probably got too may print statements in there. You do realize they are only there for debugging - right? Just comment out the ones you are not interested in and add more until you find your problem.
Quote:
2) This one is really the one that is puzzling me the most. I also see the person that asked for someone to help on this recipe faced a similar problem with the xml (that is why i didn't use the feed was trying this method to get the thumbnails). but for some reason The thumbnails don't come through. I looked in firebug and they appear to be wrapped inside the mainContent tag. I even went as far as taking and commenting out the keep only tags and was faced with the same results.
|
I briefly looked at someone's question about missing thumbnail images. I can't tell you (yet) what's going on, but here's my process:
1) If something isn't appearing, make sure your own keep_only or remove_tags aren't stripping it. Try to get it to appear with all the other junk.
2) Maybe it's being removed with removal of scripting. Look at the page source to see. Try leaving scripts on in your test recipe.
3) If it still looks like the item should be picked up, sometimes the site is protecting the image from scraping. You may need to have the correct useragent, the correct cookie, the correct referer header, etc. FireFox and TamperData help here. There are techniques for simulating each of these. I try to get FireFox to act like Calibre (or vice-versa) to verify.
The bottom line is that if FireFox can see it, so can your recipe.