12-22-2011, 02:26 PM | #1 |
Member
Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle
|
Having problems using h1 tag since website changed
I've been using a recipe to parse jg-tc.com for a while now. Here are the tags I'm keeping/removing:
keep_only_tags = [ dict(name='h1'), dict(name='span', attrs={'class':'updated'}), dict(name='span', attrs={'class':'fn'}), dict(name='img', attrs={'id':'img-holder'}), dict(name='span', attrs={'id':'gallery-cutline'}), dict(name='div', attrs={'id':'blox-story-text'}) ] remove_tags = [ dict(name='a') ] The problem is: they changed something with a facebook login or something, so now the titles to all the stories read like: Login or Title Would somebody help me try to pull the title differently, or remove this "Login or" tag or text after the fact? I've tried several things myself and I'm struggling. Thanks! Here's a sample webpage: http://jg-tc.com/news/local/school-a...871e3ce6c.html |
12-24-2011, 05:07 PM | #2 |
Member
Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle
|
I've worked on this thing for a long time now and I can't seem to figure it out. There's a stupid "facebox" tag that's killing me. I would like to keep the h1 tag inside of the blox-story.
This seems like it would work, but it doesn't... keep_only_tags = dict(id='blox-story', attrs={'name':'h1'}), I've tried probably 100 things, keeping tags before, keeping tags after, trying to use the "meta title" instead of the h1, and I feel like I've exhausted my very limited knowledge of how this thing works. I know what I want it to do, I just don't know how to tell it to do it. Please do help! |
Advert | |
|
12-25-2011, 01:45 PM | #3 |
doofus
Posts: 2,507
Karma: 12615905
Join Date: Sep 2010
Device: Kobo Libra 2, Kindle Voyage
|
I'm not sure what you are trying to accomplish. My guess is the problem is you're trying to keep a tag inside of another tag you are not keeping.
For example, say you have something like Div Id=main ** *Div id=inner ** * * H1 xxxxx /h1 ** */div Div .... /div /div You can't just keep*H1, you need to keep the div id= main. In general, you want to keep the innermost container element that holds all the contents you want to keep, like [dict(id='main')], and use remove_tags to trim stuff you don't want. |
12-26-2011, 12:27 AM | #4 |
Member
Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle
|
Barty,
Thanks for the help. After some work, I've been able to reorganize my code to make it do what I like, like you said. The only problem now is I've lost my images. The picture is in an img tag, which is buried inside of a link. Since I'm batch removing all links, I am not sure what to do. Here's what I have. Code:
remove_tags_before = dict(name='div', attrs={'id':'blox-left-col'}) remove_tags_after = dict(name='div', attrs={'id':'blox-left-col'}) keep_only_tags = [ # dict(name='div', attrs={'id':'blox-left-col'}), # dict(name='span', attrs={'class':'updated'}), # dict(name='span', attrs={'class':'fn'}), # dict(name='img', attrs={'id':'img-holder'}), # dict(name='span', attrs={'id':'gallery-cutline'}), # dict(name='div', attrs={'id':'blox-story-text'}) ] remove_tags = [ dict(name='a'), dict(name='ul', attrs={'id':'blox-body-nav'}), dict(name='p', attrs={'class':'story-keywords moz-border'}), dict(name='div', attrs={'class':'clear'}), dict(name='div', attrs={'class':'hide'}), dict(name='p', attrs={'id':'story-tools'}), dict(name='div', attrs={'id':'latest-by-section'}), dict(name='span', attrs={'class':'bookmark hide'}), dict(name='span', attrs={'class':'hide source-org vcard'}), dict(name='dl', attrs={'id':'story-font-size'}), dict(name='div', attrs={'class':'article-share-top'}), dict(name='span', attrs={'id':'pictopiaURL'}), dict(name='span', attrs={'id':'siteHost'}), dict(name='span', attrs={'id':'mycaptureURL'}), dict(name='span', attrs={'id':'mycapturePricingSheet'}), dict(name='div', attrs={'class':'photo-cutline'}), dict(name='div', attrs={'class':'blox-thumb-container'}) ] <a href="http://bloximages.chicago2.vip.townnews.com/jg-tc.com/content/tncms/assets/v3/editorial/1/93/19381b50-904a-5d15-91d0-9183c727b977/4eee65251d356.image.jpg" rel="facebox"> <img id="img-holder" src="http://bloximages.chicago2.vip.townnews.com/jg-tc.com/content/tncms/assets/v3/editorial/1/93/19381b50-904a-5d15-91d0-9183c727b977/4eee6525d4f59.preview-300.jpg" alt=" " width="300px"/> </a> I used to just pull the "img-holder" tag out by itself, but I can't do that now. Any ideas? |
12-26-2011, 12:21 PM | #5 |
doofus
Posts: 2,507
Karma: 12615905
Join Date: Sep 2010
Device: Kobo Libra 2, Kindle Voyage
|
you could probably do it with preprocess_regexps but why? Are you sure want to remove all links? You're going to get missing text, e.g.,
As we (link)argued in this column last month(link), the current situation is... becomes As we, the current situation is... If you want to remove certain links, then target them, for example remove_tags= [ dict(name='a',attrs={'href':re.compile(r'doublecli ck\.net',re.I)}) ] to remove doubleclick links |
Advert | |
|
12-28-2011, 03:25 PM | #6 | |
Member
Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle
|
Quote:
Code:
<a href="http://bloximages.chicago2.vip.townnews.com/jg-tc.com/content/tncms/assets/v3/editorial/0/c1/0c16b29b-e8fc-55a6-8d20-f9ba420f8230/4ef38d670cd77.image.jpg" rel="facebox"> <img id="img-holder" src="http://bloximages.chicago2.vip.townnews.com/jg-tc.com/content/tncms/assets/v3/editorial/0/c1/0c16b29b-e8fc-55a6-8d20-f9ba420f8230/4ef38d679d43b.preview-300.jpg" alt=" " width="300px"> </a> |
|
12-29-2011, 06:59 PM | #7 |
Junior Member
Posts: 3
Karma: 10
Join Date: Dec 2011
Device: Kindle
|
Hmmm.. You might want to try something like the following, in preprocess_html:
for a in soup.findAll('a'): img = a.find('img') Haven't tried it, but I'm thinking that it should replace the relevant <a> tags with the straight embedded <img> tags, and delete all the other <a> tags. Give it a shot and let me know if it works....
if img is not None: a.replaceWith(img) else:a.extract() Last edited by vtblogger; 12-29-2011 at 07:00 PM. Reason: typo |
01-03-2012, 11:36 PM | #8 |
Member
Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle
|
vtblogger,
That worked very well, thank you! The last problem I'm running into is I have one more link that I want to keep. I think it should be easy enough using an elseif or something, but I seriously struggle with this stuff. I appreciate your help this far. Inside of this mess, I would like to keep the text of the author's name. Before, I was just keeping that span. keep_only_tags ... dict(name='span', attrs={'class':'fn'}), Now, this gets deleted with the a.extract() Code:
<a href="/search/?l=50&sd=desc&s=start_time&f=html&byline=By KURT ERICKSON, JG-TC Springfield Bureau"> <span class="author vcard"><span class="fn">By KURT ERICKSON, JG-TC Springfield Bureau</span></span> </a> <span class="hide source-org vcard"><span class="org fn">JG-TC.com</span></span> for a in soup.findAll('a'): img = a.find('img') fn = a.find('fn') if img is not None: a.replaceWith(img) else: if fn is not None: a.replaceWith(fn) else: a.extract() |
01-06-2012, 11:05 AM | #9 |
Junior Member
Posts: 3
Karma: 10
Join Date: Dec 2011
Device: Kindle
|
Glad I could help.
You're almost there with your next hurdle. Try this: for a in soup.findAll('a'): img = a.find('img')
if img is not None: a.replaceWith(img) else:fn = a.find('span',attrs={'class':'fn'}) if fn is not None: a.replaceWith(fn) else:a.extract() |
01-11-2012, 12:21 AM | #10 |
Member
Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle
|
vtblogger,
Thank you so much for your help... that did the trick! I got a new computer so I wasn't able to test this right away, but it works great and I appreciate all of your help. Now reading the paper will be much more enjoyable! |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Psychology Today website changed | Shuichiro | Recipes | 9 | 08-31-2011 02:06 PM |
h3 tag is being changed on save | 1611mac | Sigil | 5 | 04-22-2011 05:04 PM |
Adding an Owner tag to tag list? | Fangles | Library Management | 1 | 02-25-2011 02:32 AM |
Nook Color Website Problems (help | JamesG | Nook Color & Nook Tablet | 8 | 02-24-2011 10:24 AM |
FeedBooks - problems connecting to website | holden1 | Sony Reader | 2 | 07-25-2008 12:30 PM |