Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 12-22-2011, 03:26 PM   #1
clintiepoo
Member
clintiepoo began at the beginning.
 
Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle
Having problems using h1 tag since website changed

I've been using a recipe to parse jg-tc.com for a while now. Here are the tags I'm keeping/removing:

keep_only_tags = [
dict(name='h1'),
dict(name='span', attrs={'class':'updated'}),
dict(name='span', attrs={'class':'fn'}),
dict(name='img', attrs={'id':'img-holder'}),
dict(name='span', attrs={'id':'gallery-cutline'}),
dict(name='div', attrs={'id':'blox-story-text'})

]
remove_tags = [
dict(name='a')
]

The problem is: they changed something with a facebook login or something, so now the titles to all the stories read like:

Login or
Title

Would somebody help me try to pull the title differently, or remove this "Login or" tag or text after the fact? I've tried several things myself and I'm struggling. Thanks!

Here's a sample webpage:
http://jg-tc.com/news/local/school-a...871e3ce6c.html
clintiepoo is offline   Reply With Quote
Old 12-24-2011, 06:07 PM   #2
clintiepoo
Member
clintiepoo began at the beginning.
 
Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle
I've worked on this thing for a long time now and I can't seem to figure it out. There's a stupid "facebox" tag that's killing me. I would like to keep the h1 tag inside of the blox-story.

This seems like it would work, but it doesn't...

keep_only_tags = dict(id='blox-story', attrs={'name':'h1'}),

I've tried probably 100 things, keeping tags before, keeping tags after, trying to use the "meta title" instead of the h1, and I feel like I've exhausted my very limited knowledge of how this thing works. I know what I want it to do, I just don't know how to tell it to do it. Please do help!
clintiepoo is offline   Reply With Quote
Old 12-25-2011, 02:45 PM   #3
Barty
Wizard
Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.
 
Posts: 1,573
Karma: 3139999
Join Date: Sep 2010
Device: Kindle 3, PW2, iPad 3
I'm not sure what you are trying to accomplish. My guess is the problem is you're trying to keep a tag inside of another tag you are not keeping.

For example, say you have something like


Div Id=main
** *Div id=inner
** * * H1 xxxxx /h1
** */div
Div .... /div
/div

You can't just keep*H1, you need to keep the div id= main. In general, you want to keep the innermost container element that holds all the contents you want to keep, like [dict(id='main')], and use remove_tags to trim stuff you don't want.
Barty is offline   Reply With Quote
Old 12-26-2011, 01:27 AM   #4
clintiepoo
Member
clintiepoo began at the beginning.
 
Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle
Barty,

Thanks for the help. After some work, I've been able to reorganize my code to make it do what I like, like you said. The only problem now is I've lost my images. The picture is in an img tag, which is buried inside of a link. Since I'm batch removing all links, I am not sure what to do. Here's what I have.

Code:
    remove_tags_before = dict(name='div', attrs={'id':'blox-left-col'})
    remove_tags_after = dict(name='div', attrs={'id':'blox-left-col'})
    keep_only_tags = [ 
#                        dict(name='div', attrs={'id':'blox-left-col'}),
#                        dict(name='span', attrs={'class':'updated'}),
#                        dict(name='span', attrs={'class':'fn'}),
#                        dict(name='img', attrs={'id':'img-holder'}),
#                        dict(name='span', attrs={'id':'gallery-cutline'}),
#                        dict(name='div', attrs={'id':'blox-story-text'})
                     ]
    remove_tags = [
					 dict(name='a'),
                     dict(name='ul', attrs={'id':'blox-body-nav'}),
					 dict(name='p', attrs={'class':'story-keywords moz-border'}),
					 dict(name='div', attrs={'class':'clear'}),
					 dict(name='div', attrs={'class':'hide'}),
					 dict(name='p', attrs={'id':'story-tools'}),
					 dict(name='div', attrs={'id':'latest-by-section'}),
                     dict(name='span', attrs={'class':'bookmark hide'}),
					 dict(name='span', attrs={'class':'hide source-org vcard'}),
					 dict(name='dl', attrs={'id':'story-font-size'}),
					 dict(name='div', attrs={'class':'article-share-top'}),
					 dict(name='span', attrs={'id':'pictopiaURL'}),
					 dict(name='span', attrs={'id':'siteHost'}),
					 dict(name='span', attrs={'id':'mycaptureURL'}),
					 dict(name='span', attrs={'id':'mycapturePricingSheet'}),
					 dict(name='div', attrs={'class':'photo-cutline'}),
					 dict(name='div', attrs={'class':'blox-thumb-container'})
					 
					 
                  ]
Here's how the website has this tag set up:

<a href="http://bloximages.chicago2.vip.townnews.com/jg-tc.com/content/tncms/assets/v3/editorial/1/93/19381b50-904a-5d15-91d0-9183c727b977/4eee65251d356.image.jpg" rel="facebox">

<img id="img-holder" src="http://bloximages.chicago2.vip.townnews.com/jg-tc.com/content/tncms/assets/v3/editorial/1/93/19381b50-904a-5d15-91d0-9183c727b977/4eee6525d4f59.preview-300.jpg" alt=" " width="300px"/>
</a>

I used to just pull the "img-holder" tag out by itself, but I can't do that now. Any ideas?
clintiepoo is offline   Reply With Quote
Old 12-26-2011, 01:21 PM   #5
Barty
Wizard
Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.
 
Posts: 1,573
Karma: 3139999
Join Date: Sep 2010
Device: Kindle 3, PW2, iPad 3
you could probably do it with preprocess_regexps but why? Are you sure want to remove all links? You're going to get missing text, e.g.,

As we (link)argued in this column last month(link), the current situation is...

becomes

As we, the current situation is...

If you want to remove certain links, then target them, for example

remove_tags= [ dict(name='a',attrs={'href':re.compile(r'doublecli ck\.net',re.I)}) ]

to remove doubleclick links
Barty is offline   Reply With Quote
Old 12-28-2011, 04:25 PM   #6
clintiepoo
Member
clintiepoo began at the beginning.
 
Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle
Quote:
Originally Posted by Barty View Post
you could probably do it with preprocess_regexps but why? Are you sure want to remove all links? You're going to get missing text, e.g.,

As we (link)argued in this column last month(link), the current situation is...

becomes

As we, the current situation is...

If you want to remove certain links, then target them, for example

remove_tags= [ dict(name='a',attrs={'href':re.compile(r'doublecli ck\.net',re.I)}) ]

to remove doubleclick links
I understand your logic, and now I've been able to get rid of all the links individually by using tags within them. But, I'm still having trouble getting the image out of this:

Code:
<a href="http://bloximages.chicago2.vip.townnews.com/jg-tc.com/content/tncms/assets/v3/editorial/0/c1/0c16b29b-e8fc-55a6-8d20-f9ba420f8230/4ef38d670cd77.image.jpg" rel="facebox">
            
                <img id="img-holder" src="http://bloximages.chicago2.vip.townnews.com/jg-tc.com/content/tncms/assets/v3/editorial/0/c1/0c16b29b-e8fc-55a6-8d20-f9ba420f8230/4ef38d679d43b.preview-300.jpg" alt=" " width="300px">
            </a>
The image is buried beneath the link. I used to use "img-holder to isolate it and just keep that tag, but I'm not sure how to do it now. If I try to keep this whole link (ie not remove it) the whole story blows up and fails to download. I'm getting close, just not there.
clintiepoo is offline   Reply With Quote
Old 12-29-2011, 07:59 PM   #7
vtblogger
Junior Member
vtblogger began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Dec 2011
Device: Kindle
Hmmm.. You might want to try something like the following, in preprocess_html:

for a in soup.findAll('a'):
img = a.find('img')
if img is not None:
a.replaceWith(img)
else:
a.extract()
Haven't tried it, but I'm thinking that it should replace the relevant <a> tags with the straight embedded <img> tags, and delete all the other <a> tags. Give it a shot and let me know if it works....

Last edited by vtblogger; 12-29-2011 at 08:00 PM. Reason: typo
vtblogger is offline   Reply With Quote
Old 01-04-2012, 12:36 AM   #8
clintiepoo
Member
clintiepoo began at the beginning.
 
Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle
vtblogger,

That worked very well, thank you! The last problem I'm running into is I have one more link that I want to keep. I think it should be easy enough using an elseif or something, but I seriously struggle with this stuff. I appreciate your help this far.

Inside of this mess, I would like to keep the text of the author's name. Before, I was just keeping that span.

keep_only_tags ... dict(name='span', attrs={'class':'fn'}),

Now, this gets deleted with the a.extract()


Code:
                <a href="/search/?l=50&sd=desc&s=start_time&f=html&byline=By KURT ERICKSON, JG-TC Springfield Bureau">
                    <span class="author vcard"><span class="fn">By KURT ERICKSON, JG-TC Springfield Bureau</span></span>
                </a>
                <span class="hide source-org vcard"><span class="org fn">JG-TC.com</span></span>
This is my flawed code... any more ideas? Did I mention I appreciated your help!?!

for a in soup.findAll('a'):
img = a.find('img')
fn = a.find('fn')
if img is not None:
a.replaceWith(img)
else:
if fn is not None:
a.replaceWith(fn)
else:
a.extract()
clintiepoo is offline   Reply With Quote
Old 01-06-2012, 12:05 PM   #9
vtblogger
Junior Member
vtblogger began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Dec 2011
Device: Kindle
Glad I could help.
You're almost there with your next hurdle. Try this:

for a in soup.findAll('a'):
img = a.find('img')

if img is not None:
a.replaceWith(img)
else:
fn = a.find('span',attrs={'class':'fn'})
if fn is not None:
a.replaceWith(fn)
else:
a.extract()
vtblogger is offline   Reply With Quote
Old 01-11-2012, 01:21 AM   #10
clintiepoo
Member
clintiepoo began at the beginning.
 
Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle
vtblogger,

Thank you so much for your help... that did the trick! I got a new computer so I wasn't able to test this right away, but it works great and I appreciate all of your help. Now reading the paper will be much more enjoyable!
clintiepoo is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Psychology Today website changed Shuichiro Recipes 9 08-31-2011 03:06 PM
h3 tag is being changed on save 1611mac Sigil 5 04-22-2011 06:04 PM
Adding an Owner tag to tag list? Fangles Library Management 1 02-25-2011 03:32 AM
Nook Color Website Problems (help JamesG Nook Color & Nook Tablet 8 02-24-2011 11:24 AM
FeedBooks - problems connecting to website holden1 Sony Reader 2 07-25-2008 01:30 PM


All times are GMT -4. The time now is 04:47 AM.


MobileRead.com is a privately owned, operated and funded community.