View Single Post
Old 06-02-2011, 12:10 PM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
For Arcamax comics, you can try:
Code:
    remove_tags    = [dict(name='a', attrs={'class':'author bio'})]
You might as well remove the entire a tag, as there's nothing left if you just remove the part you marked in red.

For GoComics, I'll make some suggestions. Here is where you go to see how to do this. It's the BeautifulSoup documentation. You can try removing all span tags. That's probably too aggressive. You can try removing the first span tag in each h1 tag. Usually preprocess_html is used.
Code:
    def preprocess_html(self, soup):
        for h1 in soup.findAll('h1'):
            span = h1.find('span')
            if span:
                span.extract()
That's untested code.

Last, you can try using regular expressions in your remove_tags. Remove any span tag that has the " by " in it.

Here's some running code you can look over that hunts around in the soup for break tags and removes them based on attributes and the existence of Sibling tags.

Code:
    def preprocess_html(self, soup):
        for br in soup.findAll('br'):
            prev = br.findPreviousSibling(True)
            if hasattr(prev, 'name') and prev.name == 'br':
                next = br.findNextSibling(True)
                if hasattr(next, 'name') and next.name == 'br':
                    br.extract()
Starson17 is offline   Reply With Quote