Very new to this - please help me parse a local newspaper's RSS

clintiepoo · 02-19-2011, 09:17 PM

Hi,

I'm trying to work on the Herald and Review (herald-review.com). I don't know Python, so I'm starting with the Science Daily recipe and modifying it. Here's what I have so far:

Code:

#!/usr/bin/env  python


'''
http://www.herald-review.com
'''
from calibre.web.feeds.news import BasicNewsRecipe

class DecaturHerald(BasicNewsRecipe):
    title                 = u'Herald and Review'
    __author__            = u'Clint'
    description           = u"Decatur, IL Newspaper"
    oldest_article        = 7
    language = 'en'

    max_articles_per_feed = 100
    no_stylesheets        = True
    use_embedded_content  = False
    
    cover_url = 'http://www.herald-review.com/content/tncms/live/global/resources/images/hr_logo.jpg'
    
    keep_only_tags = [ 
                        dict(name='h1'),
                        dict(name='span', attrs={'class':'updated'}),
                        dict(name='img', attrs={'id':'img-holder'}),                        
                        dict(name='div', attrs={'id':'blox-story-text'}) 
                     ]
           
                     
    feeds       = [ 

                    (u'Local Business ', u'http://www.herald-review.com/search/?f=rss&c[]=business/local&sd=desc&s=start_time')

                    ]

Some problems I have:

The title shows up twice, once as a link. I'm not sure how to fix this.
The picture and the date are on the same line.

Any help is appreciated. This is probably really easy, but I'm not seeing it.

clintiepoo · 02-20-2011, 10:51 PM

I got the double-title to go away with this code.

Code:

    remove_tags = [
                     dict(name='a')
                    
                  ]

I'm still not sure how to get the date and picture to show up on different lines. Anybody?

Eventually, I'd like to format the headline and date fonts to a different format too.

clintiepoo · 02-22-2011, 10:10 PM

Quote:

Originally Posted by clintiepoo

I got the double-title to go away with this code.

Code:

    remove_tags = [
                     dict(name='a')
                    
                  ]

I'm still not sure how to get the date and picture to show up on different lines. Anybody?

Eventually, I'd like to format the headline and date fonts to a different format too.

Guys, please help. How do I put spaces between the different tags I'm using? Right now, everything is stringing together in one big line. This can't be that hard.

How it is:

dateIMAGEcaption

I want:

date
IMAGE
caption

Code:

#!/usr/bin/env  python


'''
http://www.herald-review.com
'''
from calibre.web.feeds.news import BasicNewsRecipe

class DecaturHerald(BasicNewsRecipe):
    title                 = u'Herald and Review'
    __author__            = u'Clint'
    description           = u"Decatur, IL Newspaper"
    oldest_article        = 7
    language = 'en'

    max_articles_per_feed = 100
    no_stylesheets        = True
    use_embedded_content  = False
    extra_css = '''
                 h1               {text-align:left;}
                 .updated         {font-family:monospace;text-align:left;margin-bottom: 1em;}
                 .img             {text-align:center;}
                 .gallery-cutline {text-align:center;font-size:smaller;font-style:italic}
                 .credit          {text-align:right;margin-bottom:0em;font-size:smaller;}
                 .div             {text-align:left;}
                 '''
    
    cover_url = 'http://www.herald-review.com/content/tncms/live/global/resources/images/hr_logo.jpg'
    
    keep_only_tags = [ 
                        dict(name='h1'),
                        dict(name='span', attrs={'class':'updated'}),
                        dict(name='img', attrs={'id':'img-holder'}),
                        dict(name='span', attrs={'id':'gallery-cutline'}),                        
                        dict(name='div', attrs={'id':'blox-story-text'}) 
                     ]
                     
    remove_tags = [
                     dict(name='a')                 
                  ]       
                     
    feeds       = [ 
                    (u'Local News', u'http://www.herald-review.com/search/?f=rss&c[]=news/local&sd=desc&s=start_time'),
#                    (u'Breaking News', u'http://www.herald-review.com/search/?f=rss&k[]=%23breaking&sd=desc&s=start_time'),
#                    (u'State and Regional ', u'http://www.herald-review.com/search/?f=rss&c[]=news/state-and-regional&sd=desc&s=start_time'),
#                    (u'Crime and courts', u'http://www.herald-review.com/search/?f=rss&c[]=news/local/crime-and-courts&sd=desc&s=start_time'),
#                    (u'Local Business ', u'http://www.herald-review.com/search/?f=rss&c[]=business/local&sd=desc&s=start_time'),
#                    (u'Editorials', u'http://www.herald-review.com/search/?f=rss&c[]=news/opinion/editorial&sd=desc&s=start_time'),
#                    (u'Illini News', u'http://www.herald-review.com/search/?f=rss&q=illini&sd=desc&s=start_time')

                    ]

Starson17 · 02-23-2011, 09:51 AM

Quote:

Originally Posted by clintiepoo

Guys, please help. How do I put spaces between the different tags I'm using? Right now, everything is stringing together in one big line. This can't be that hard.

Yes it can

Quote:

How it is:
dateIMAGEcaption

I want:

date
IMAGE
caption

Here's some code from a recipe to find images and put them inside a p tag. You will find lots of recipes where the img is on the same line as the next text to avoid having to deal with this issue.

Code:

    def preprocess_html(self,soup):
        for img_tag in soup.findAll('img'):
            parent_tag = img_tag.parent
            if parent_tag.name == 'a':
                new_tag = Tag(soup,'p')
                new_tag.insert(0,img_tag)
                parent_tag.replaceWith(new_tag)
            elif parent_tag.name == 'p':
                if not self.tag_to_string(parent_tag) == '':
                    new_div = Tag(soup,'div')
                    new_tag = Tag(soup,'p')
                    new_tag.insert(0,img_tag)
                    parent_tag.replaceWith(new_div)
                    new_div.insert(0,new_tag)
                    new_div.insert(1,parent_tag)
        return soup

clintiepoo · 02-23-2011, 11:45 PM

Starson,

Thanks for your help on this. The code, however, hasn't worked for me. I think it's in how I'm chopping up the HTML. Here are my keep tags:

Code:

    keep_only_tags = [ 
                        dict(name='h1'),
                        dict(name='span', attrs={'class':'updated'}),
                        dict(name='img', attrs={'id':'img-holder'}),
                        dict(name='span', attrs={'id':'gallery-cutline'}),                        
                        dict(name='div', attrs={'id':'blox-story-text'}) 
                     ]

Because I'm calling out the img directly, and other things as 'span,' I don't know that the code you gave has the flexibility to work with this. It's running fine, just not doing anything for me. This is my suspicion.

Would you mind to look at http://www.herald-review.com/news/lo...cc4c002e0.html for example and see if you can come up with something better on the tags? I'd like to get rid of the spans, but I don't see how.

clintiepoo · 02-24-2011, 11:43 PM

I tried to reply to this yesterday, and apparently it requires moderator approval??

The code appears to run, but not to fix my problem. I think it's in the span's I'm using to parse the code (the keep_only_tags). Would that make a difference?

Starson17 · 02-25-2011, 07:51 AM

Quote:

Originally Posted by clintiepoo

I tried to reply to this yesterday, and apparently it requires moderator approval??

No moderator approval is needed.

Quote:

The code appears to run, but not to fix my problem.

It wasn't intended to solve your problem. It was intended to show you how another recipe author solved his problem and why the problem solution wasn't simple. You'll need to customize for your site.

rylsfan · 02-28-2011, 09:37 AM

One nice thing about this paper is it is (for now) easy to figure out the printable version from the website.

Here is an article's url:
...business/local/article_084a9798-8890-557d-b091-37a611b9337e.html

Here is the printable version of that same article:
.../business/local/article_084a9798-8890-557d-b091-37a611b9337e.html?print=1

The only difference between the two is the second url appends '?print=1' to the end of the article.

You can call print_version and get an easily readable format that way:

def print_version(self, url):
return url.replace('.html', '.html?print=1')

It's no silver bullet. The printable version so far as I can tell does not copy graphics. It is readable though so that's something.

Starson17 · 02-28-2011, 01:44 PM

Quote:

Originally Posted by rylsfan

YThe printable version so far as I can tell does not copy graphics. It is readable though so that's something.

If I understood his problem, it was just that he didn't have nice spacing between the parts he kept in the recipe. I'd rather have that problem than miss entire graphics (which is why I seldom use print_version - it often skips important parts of the full page.) I didn't have time to do the job for him, but the code I posted is how it is often done.

rylsfan · 02-28-2011, 04:46 PM

Whoops! It seems as if I had serious misinterpreted the situation. Thanks for the help!

clintiepoo · 03-01-2011, 08:56 PM

I spent some time on the code, and I'm just not seeing how to do this. It tags the parent tag with tags, which I get, but my problem (I think) is that I'm grabbing spans and not a div's. These just kind of sit out there inside the body, with no parent tag around them.

Is there a way to put what I grab (for example, dict(name='img', attrs={'id':'img-holder'})) and put a tag around that?

<Newtag><img id="img-holder" src="images/img1.jpg" alt=" " class="calibre12"></Newtag>

I think if I could do that, I could add tags or anything around the content.

Starson17 · 03-02-2011, 09:25 AM

Quote:

Originally Posted by clintiepoo

I spent some time on the code, and I'm just not seeing how to do this. It tags the parent tag with tags, which I get, but my problem (I think) is that I'm grabbing spans and not a div's. These just kind of sit out there inside the body, with no parent tag around them.

HTML is organized as a tree, so all tags have a parent, except the root tag.

Quote:

Is there a way to put what I grab (for example, dict(name='img', attrs={'id':'img-holder'})) and put a tag around that?

You're going to need to look at Beautiful Soup then go back to the code I gave you.

I took a look at your site. The parent of your img tag is the body tag. You can use that tag or play with the next sibling and previous sibling tags. You can't just tell BS to put the img inside another tag. You can create a p tag (with "Tag") and you can put your found img tag inside it, but now it's no longer in your page. It's hanging free and that p tag with the img inside it needs to be put into a tag in the tree forming your page. You'll need to use insert or replaceWith from BS to do that. Study the code I gave you, then study the BS docs at the link above to see how it's done. I suspect you'll come to understand why this is a pain to solve, and that's where we started.

Starson17 · 03-02-2011, 04:26 PM

Quote:

Originally Posted by Starson17

I took a look at your site. The parent of your img tag is the body tag. You can use that tag or play with the next sibling and previous sibling tags. You can't just tell BS to put the img inside another tag. You can create a p tag (with "Tag") and you can put your found img tag inside it, but now it's no longer in your page. It's hanging free and that p tag with the img inside it needs to be put into a tag in the tree forming your page. You'll need to use insert or replaceWith from BS to do that. Study the code I gave you, then study the BS docs at the link above to see how it's done. I suspect you'll come to understand why this is a pain to solve, and that's where we started.

I had a few more minutes, so I thought I'd explain what's happening in the code I posted so you can modify it. BeautifulSoup makes it easy to find your img tag. The problem is what to do with it. Let's say you create a p tag, then put your img tag inside. This actually moves the img tag away from your page, and puts it into your disconnected created-from-scratch p tag. Your problem now is that you've lost track of where the img tag came from. The solution is to figure out where it came from before you remove it.

The code I posted dealt with this issue by replacing the parent_tag (for an img tag embedded in an a tag). The a tag remained in the page soup, and the new p tag, with the img tag inside it, was used to replace the parent a tag.

That was the first case in the if statement. The second case that author dealt with was where the parent tag was a p tag, not an a tag. Now you might ask why he was putting a p tag around the img if the parent tag was already a p tag. That's because the img tag wasn't the only content in the parent p tag. There was also text, and the author was trying to get the img onto a new line, separate from the other text.

To do that, he created a div tag and a p tag. He put the img tag into the new p tag and put that into the new div tag. Now he's got a disconnected div tag with the p tag and img tag inside it. At this point, the img tag (and its image) are no longer on the page. He's going to have to put it back. The text inside the parent tag is still in the page. Next, he replaces the parent tag with the div tag. This deletes the parent tag from the page, but it's not lost. He still has a reference to that parent tag (minus the img tag he previously removed from it.) He sticks the parent tag into the div tag that is now in the page soup with "new_div.insert(1,parent_tag)". The index of 1 means that he puts the parent tag (with the text) in after the img tag (which has index 0.) The result is a new p tag around the img tag, followed by the original text in the original parent tag.

So now you know why I said it's not that easy, and why many recipes don't bother to do this cleanup. All you have to do is figure out how to apply this to your page.

clintiepoo · 03-04-2011, 10:18 PM

Quote:

Originally Posted by Starson17

I had a few more minutes, so I thought I'd explain what's happening in the code I posted so you can modify it. BeautifulSoup makes it easy to find your img tag. The problem is what to do with it. Let's say you create a p tag, then put your img tag inside. This actually moves the img tag away from your page, and puts it into your disconnected created-from-scratch p tag. Your problem now is that you've lost track of where the img tag came from. The solution is to figure out where it came from before you remove it.

The code I posted dealt with this issue by replacing the parent_tag (for an img tag embedded in an a tag). The a tag remained in the page soup, and the new p tag, with the img tag inside it, was used to replace the parent a tag.

That was the first case in the if statement. The second case that author dealt with was where the parent tag was a p tag, not an a tag. Now you might ask why he was putting a p tag around the img if the parent tag was already a p tag. That's because the img tag wasn't the only content in the parent p tag. There was also text, and the author was trying to get the img onto a new line, separate from the other text.

To do that, he created a div tag and a p tag. He put the img tag into the new p tag and put that into the new div tag. Now he's got a disconnected div tag with the p tag and img tag inside it. At this point, the img tag (and its image) are no longer on the page. He's going to have to put it back. The text inside the parent tag is still in the page. Next, he replaces the parent tag with the div tag. This deletes the parent tag from the page, but it's not lost. He still has a reference to that parent tag (minus the img tag he previously removed from it.) He sticks the parent tag into the div tag that is now in the page soup with "new_div.insert(1,parent_tag)". The index of 1 means that he puts the parent tag (with the text) in after the img tag (which has index 0.) The result is a new p tag around the img tag, followed by the original text in the original parent tag.

So now you know why I said it's not that easy, and why many recipes don't bother to do this cleanup. All you have to do is figure out how to apply this to your page.

I can understand what you're saying, but I can't seem to make it work for my website.

I wouldn't be offended if you just went ahead and made it work.

I'm trying something like this.

Code:

    def preprocess_html(self,soup):
        for pix in soup.findAll('img'):
            new_tag=tag(soup,'p')
            new_tag.insert(0,pix)
            pix.replaceWith(new_tag)
        return soup

In my mind, this should first find all the images. For example.

<img id="img-holder" src="xyz.jpg" alt=" " width="300px">

Next, it inserts a tag around the img, returning:

<img id="img-holder" src="xyz.jpg" alt=" " width="300px">

Finally, it takes all of this, and uses it in place of what the img before.

Unfortunately, it doesn't work at all. I understand it's not part of the page anymore, but I figured it would stick it on the bottom or something. When I do this, most of the articles fail to even download.

A couple more questions:

Is there a way to step through this code and watch variables? I'm a lot more comfortable with vba, and you can do that easily in there.

I am still thinking the problem is that my img's parent is the body. I don't know how to fix that.

Here's an example what's around the img:

Code:

<div id="blox-story-photo-container">
		<span id="pictopiaURL" title="http://pictopia.com/perl/ptp/heraldreview"></span>
		<span id="siteHost" title="http://www.herald-review.com"></span>
		
		
		
		<div id="blox-large-photo-page">
			<a name="photos"></a>
			
			
			
				<a href="http://bloximages.chicago2.vip.townnews.com/herald-review.com/content/tncms/assets/editorial/5/d5/290/5d529006-4493-11e0-b654-001cc4c002e0-revisions/4d6ddeaceb281.image.jpg" rel="facebox">
			
				<img id="img-holder" src="http://bloximages.chicago2.vip.townnews.com/herald-review.com/content/tncms/assets/editorial/5/d5/290/5d529006-4493-11e0-b654-001cc4c002e0-revisions/4d6ddeacee535.preview-300.jpg" alt=" " width="300px">
			</a>
			
			
			
				<p class="photo-cutline">
					
						
					
					<a id="gallery-buy" href="http://pictopia.com/perl/ptp/heraldreview?photo_name=676ecb86-4493-11e0-8e81-001cc4c002e0&amp;title=Bill Cole, E'Twaun Moore&amp;t_url=http://bloximages.chicago2.vip.townnews.com/herald-review.com/content/tncms/assets/editorial/5/d5/290/5d529006-4493-11e0-b654-001cc4c002e0-revisions/4d6ddeaceb281.image.jpg&amp;fs_url=http://bloximages.chicago2.vip.townnews.com/herald-review.com/content/tncms/assets/editorial/5/d5/290/5d529006-4493-11e0-b654-001cc4c002e0-revisions/4d6ddeacef8bc.hires.jpg&amp;pps=buynow" rel="external"><img src="global/resources/images/buy-photo.gif" alt="buy this photo"></a>
					<span id="gallery-cutline">Purdue guard E'Twaun Moore, right, shoots over Illinois forward Bill Cole in the second half of an NCAA college basketball game in West Lafayette, Ind., Tuesday, March 1, 2011. Purdue defeated Illinois 75-67.</span>
					<span class="clear"></span>
				</p> 
			
		</div>
		
		
		
		
		<div class="clear"></div>
	</div>

Starson17 · 03-05-2011, 09:47 AM

Quote:

Originally Posted by clintiepoo

I can understand what you're saying, but I can't seem to make it work for my website.

I wouldn't be offended if you just went ahead and made it work.

Give a fish versus teach to fish.

Besides, If I just do it, I have to actually make it work, whereas if I stick to telling you how to do it yourself, I can always hide behind the claim it's your fault it didn't work.

Quote:

I'm trying something like this.

Code:

    def preprocess_html(self,soup):
        for pix in soup.findAll('img'):
            new_tag=tag(soup,'p')
            new_tag.insert(0,pix)
            pix.replaceWith(new_tag)
        return soup

In my mind, this should first find all the images.

Yes, it does.

Quote:

Next, it inserts a tag around the img, returning:

<img id="img-holder" src="xyz.jpg" alt=" " width="300px">

Correct.

Quote:

Finally, it takes all of this, and uses it in place of what the img before.

Nope. In my long description:
"Now he's got a disconnected div tag with the p tag and img tag inside it. At this point, the img tag (and its image) are no longer on the page. He's going to have to put it back. "
In your case, you created a "disconnected" p tag as a new tag, then removed the img from your page. How can you "uses it in place of what the img before." if the page no longer has the img tag on it? It's now in the new p tag. You need a reference to a tag that's still on the page.

Quote:

Unfortunately, it doesn't work at all.

Correct

Quote:

I understand it's not part of the page anymore, but I figured it would stick it on the bottom or something.

Nope. You've lost the reference inside the page.

Quote:

Is there a way to step through this code and watch variables?

Yes, but it's not worth the effort. Just add this print statement:

Code:

print 'My variable x is now: ', x

Quote:

I am still thinking the problem is that my img's parent is the body. I don't know how to fix that.

You don't have to use the body. You can use next or previous or next sibling, etc. The only reason the code I posted used the parent was because the author needed to keep a placeholder inside the page, and the parent was still there when the img was removed, but the parent had nothing of value in it, so it could be replaced. Yes, you have to use something on the page. You can't replace the entire parent, (unless you grab everything else you need). You could try inserting into the parent.

Quote:

Here's an example what's around the img:

I already looked at it, so you don't need to post it. Your problem is simple. One option is to put a placeholder into the page with next or previous or nextSibling so you can do a replaceWith or insert into it. Another option is to replace something that you don't need and that's still in the body. Still another option is to just label the parent (body tag) then do an insert (of your p tag with the extracted img) at a numerical position into the body. One of those should work. (I tested one and it worked, but the img was out of order, and I'm not at that machine now.)

02-20-2011, 10:51 PM	#2
clintiepoo Member Posts: 19 Karma: 10 Join Date: Feb 2011 Device: kindle	I got the double-title to go away with this code. Code: remove_tags = [ dict(name='a') ] I'm still not sure how to get the date and picture to show up on different lines. Anybody? Eventually, I'd like to format the headline and date fonts to a different format too.

02-23-2011, 11:45 PM	#5
clintiepoo Member Posts: 19 Karma: 10 Join Date: Feb 2011 Device: kindle	Starson, Thanks for your help on this. The code, however, hasn't worked for me. I think it's in how I'm chopping up the HTML. Here are my keep tags: Code: keep_only_tags = [ dict(name='h1'), dict(name='span', attrs={'class':'updated'}), dict(name='img', attrs={'id':'img-holder'}), dict(name='span', attrs={'id':'gallery-cutline'}), dict(name='div', attrs={'id':'blox-story-text'}) ] Because I'm calling out the img directly, and other things as 'span,' I don't know that the code you gave has the flexibility to work with this. It's running fine, just not doing anything for me. This is my suspicion. Would you mind to look at http://www.herald-review.com/news/lo...cc4c002e0.html for example and see if you can come up with something better on the tags? I'd like to get rid of the spans, but I don't see how.

02-28-2011, 04:46 PM	#10
rylsfan Member Posts: 18 Karma: 10 Join Date: Feb 2010 Device: Kindle2	Whoops! It seems as if I had serious misinterpreted the situation. Thanks for the help! Last edited by rylsfan; 02-28-2011 at 09:49 PM.

03-01-2011, 08:56 PM	#11
clintiepoo Member Posts: 19 Karma: 10 Join Date: Feb 2011 Device: kindle	I spent some time on the code, and I'm just not seeing how to do this. It tags the parent tag with <p> tags, which I get, but my problem (I think) is that I'm grabbing spans and not a div's. These just kind of sit out there inside the body, with no parent tag around them. Is there a way to put what I grab (for example, dict(name='img', attrs={'id':'img-holder'})) and put a tag around that? <Newtag><img id="img-holder" src="images/img1.jpg" alt=" " class="calibre12"></Newtag> I think if I could do that, I could add <p> tags or anything around the content.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
IQ Parse Error when downloading apps on IQ	tasha326	PocketBook	6	01-20-2011 12:09 AM
Initial parse failed:	mburgoa	Calibre	4	08-07-2010 08:50 AM
I dont live in any of the subscription newspaper's cities...	kilofox	Amazon Kindle	9	04-02-2008 04:33 PM
from Italy...is PSR 505 good for newspaper's layout?	ionontelodico	Sony Reader	5	12-20-2007 02:12 PM

02-24-2011, 11:43 PM	#6
clintiepoo Member Posts: 19 Karma: 10 Join Date: Feb 2011 Device: kindle	I tried to reply to this yesterday, and apparently it requires moderator approval?? The code appears to run, but not to fix my problem. I think it's in the span's I'm using to parse the code (the keep_only_tags). Would that make a difference?

Advert

Advert