Quote:
Originally Posted by kovidgoyal
stick some print statements into fetch_url to debug the session. Also try customizing get_browser to disable cookies/handle refreshe, etc.
|
I'm not sure if you saw my edit - the 0A character problem. I hadn't thought of debugging fetch_url. I was going down the road of trying to use preprocess_regexps.
I'm not sure if I understood it, but it looks like that will let me match and replace some portion of the fetched html page before it gets processed. I was thinking I could just remove the 0A character that was causing the problem. (I have some other uses for processing the html with a regex search replace).
However, the API described using re.compile to compile the regex, and I think I need to import re. Would this approach work, and if so, where do I import re from?
Edit:
OK, I should learn to think before typing. I solved it (with your help) The import format was easy to find. I just searched for where you used re.compile and found the answer was just 'import re'.
The print statement in fetch_url was absolutely vital to let me see that the fetch was getting a '\n' at the broken link point. I was able to remove that char with preprocess_regexps.
Thanks for the help!