Quote:
Originally Posted by Starson17
OK, your problem was so interesting, I couldn't resist looking at it further. Your problem is the bad html code in your source page http://www.51voa.com/. The closing angle bracket of each opening tag is ' />', instead of ' >'. That results in each tag being closed twice. Beautiful Soup is confused and sees the entire page as a single [document] element having a single NavigableString of text, not as multiple tags, so none of the tag-based searches or manipulation commands will work. There are no tags for BeautifulSoup to find or work with.
To fix this, you first grab that page (as you have already done in your code):
Code:
soup = self.index_to_soup('http://www.51voa.com/')
Then, grab the string that is in the contents of the big single [document] element and search and replace the bad closing brackets as follows:
Code:
rawc = soup.contents[0].string.replace(' />', ' >')
Now it's fixed, but it's still text, so you next convert the string back into a BeautifulSoup object:
Code:
soup = BeautifulSoup(rawc, fromEncoding=self.encoding)
(Also, add
Code:
from calibre.ebooks.BeautifulSoup import BeautifulSoup
to your recipe)
That's it! Put the two extra lines above after your first index_to_soup line. Be aware that any legitimate single element tags, such as <img>, <br> etc. will get mangled with the simple search and replace above. You may have to special case any tags that are allowed to have a closing slash inside the opening tag so they don't get mangled.
Edit: I forgot, you also need this line:
encoding = 'utf-8'
or else the final step will fail.
|
Thanks very much and I read a lot about BeautifulSoup after reading your words, and tried several times and failed, in the end I want to try BeautifulSoup 3.07 version to avoid the malformed closing tags.
You save me a lot.