View Single Post
Old 10-03-2010, 09:57 AM   #7
kinurev
Junior Member
kinurev began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Sep 2010
Location: Brisbane, AU
Device: Kindle
Thanks TonytheBookworm for helping me with this. The script seems to work for now but like you said, just when I need an url with double quotes to try, I could not find one.

Well, good news, while writing this I found out that the link

http://tuoitre.vn/Chinh-tri-Xa-hoi/4...-cay-canh.html

and the link

http://tuoitre.vn/Chinh-tri-Xa-hoi/403734/Kiem-lam-va-cong-an-"canh-giu"-doan-xe-tai-cho-cay-canh.html

both worked in my browser (Chrome), and that the script worked fine irrespective of the code you suggested. It seems that the problem solved itself (hopefully for good). I honestly don't know how it happened but thanks a lot for your help anyway. I'll still keep your code in the script, just in case.


@Mike L: thanks for your suggestion as well but I have very little knowledge about python so I just don't know how to use &quot.


Quote:
Originally Posted by TonytheBookworm View Post
Maybe Kovid or Starson or someone else will chime in and answer this for you and I. I don't see why the below doesn't work but that's not saying it does either.
Spoiler:

Code:
def preprocess_html(self, soup):
	 for a in soup.findAll('a'):
	 
	  a['href'] = a['href'].replace(r'(")', "%22")
	  
	 return soup


Basically in the above it SHOULD look for all anchor tags (links) in your soup and then do a regexpression lookup for all instances of " insider the href reference. If it find it replace that value with %22 which is html for a double quote. Again this may not work but I didn't really have anything to test it on other than your code but the code didn't generate any links that had " in it so I wasn't really able to test it. Give a shot and see what happens for you.
kinurev is offline   Reply With Quote