![]() |
#1 |
Member
![]() Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle
|
Very new to this - please help me parse a local newspaper's RSS
Hi,
I'm trying to work on the Herald and Review (herald-review.com). I don't know Python, so I'm starting with the Science Daily recipe and modifying it. Here's what I have so far: Code:
#!/usr/bin/env python ''' http://www.herald-review.com ''' from calibre.web.feeds.news import BasicNewsRecipe class DecaturHerald(BasicNewsRecipe): title = u'Herald and Review' __author__ = u'Clint' description = u"Decatur, IL Newspaper" oldest_article = 7 language = 'en' max_articles_per_feed = 100 no_stylesheets = True use_embedded_content = False cover_url = 'http://www.herald-review.com/content/tncms/live/global/resources/images/hr_logo.jpg' keep_only_tags = [ dict(name='h1'), dict(name='span', attrs={'class':'updated'}), dict(name='img', attrs={'id':'img-holder'}), dict(name='div', attrs={'id':'blox-story-text'}) ] feeds = [ (u'Local Business ', u'http://www.herald-review.com/search/?f=rss&c[]=business/local&sd=desc&s=start_time') ] The title shows up twice, once as a link. I'm not sure how to fix this. The picture and the date are on the same line. Any help is appreciated. This is probably really easy, but I'm not seeing it. |
![]() |
![]() |
![]() |
#2 |
Member
![]() Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle
|
I got the double-title to go away with this code.
Code:
remove_tags = [ dict(name='a') ] Eventually, I'd like to format the headline and date fonts to a different format too. |
![]() |
![]() |
Advert | |
|
![]() |
#3 | |
Member
![]() Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle
|
Quote:
How it is: dateIMAGEcaption I want: date IMAGE caption Code:
#!/usr/bin/env python ''' http://www.herald-review.com ''' from calibre.web.feeds.news import BasicNewsRecipe class DecaturHerald(BasicNewsRecipe): title = u'Herald and Review' __author__ = u'Clint' description = u"Decatur, IL Newspaper" oldest_article = 7 language = 'en' max_articles_per_feed = 100 no_stylesheets = True use_embedded_content = False extra_css = ''' h1 {text-align:left;} .updated {font-family:monospace;text-align:left;margin-bottom: 1em;} .img {text-align:center;} .gallery-cutline {text-align:center;font-size:smaller;font-style:italic} .credit {text-align:right;margin-bottom:0em;font-size:smaller;} .div {text-align:left;} ''' cover_url = 'http://www.herald-review.com/content/tncms/live/global/resources/images/hr_logo.jpg' keep_only_tags = [ dict(name='h1'), dict(name='span', attrs={'class':'updated'}), dict(name='img', attrs={'id':'img-holder'}), dict(name='span', attrs={'id':'gallery-cutline'}), dict(name='div', attrs={'id':'blox-story-text'}) ] remove_tags = [ dict(name='a') ] feeds = [ (u'Local News', u'http://www.herald-review.com/search/?f=rss&c[]=news/local&sd=desc&s=start_time'), # (u'Breaking News', u'http://www.herald-review.com/search/?f=rss&k[]=%23breaking&sd=desc&s=start_time'), # (u'State and Regional ', u'http://www.herald-review.com/search/?f=rss&c[]=news/state-and-regional&sd=desc&s=start_time'), # (u'Crime and courts', u'http://www.herald-review.com/search/?f=rss&c[]=news/local/crime-and-courts&sd=desc&s=start_time'), # (u'Local Business ', u'http://www.herald-review.com/search/?f=rss&c[]=business/local&sd=desc&s=start_time'), # (u'Editorials', u'http://www.herald-review.com/search/?f=rss&c[]=news/opinion/editorial&sd=desc&s=start_time'), # (u'Illini News', u'http://www.herald-review.com/search/?f=rss&q=illini&sd=desc&s=start_time') ] |
|
![]() |
![]() |
![]() |
#4 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
![]() Quote:
Code:
def preprocess_html(self,soup): for img_tag in soup.findAll('img'): parent_tag = img_tag.parent if parent_tag.name == 'a': new_tag = Tag(soup,'p') new_tag.insert(0,img_tag) parent_tag.replaceWith(new_tag) elif parent_tag.name == 'p': if not self.tag_to_string(parent_tag) == '': new_div = Tag(soup,'div') new_tag = Tag(soup,'p') new_tag.insert(0,img_tag) parent_tag.replaceWith(new_div) new_div.insert(0,new_tag) new_div.insert(1,parent_tag) return soup |
||
![]() |
![]() |
![]() |
#5 |
Member
![]() Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle
|
Starson,
Thanks for your help on this. The code, however, hasn't worked for me. I think it's in how I'm chopping up the HTML. Here are my keep tags: Code:
keep_only_tags = [ dict(name='h1'), dict(name='span', attrs={'class':'updated'}), dict(name='img', attrs={'id':'img-holder'}), dict(name='span', attrs={'id':'gallery-cutline'}), dict(name='div', attrs={'id':'blox-story-text'}) ] Would you mind to look at http://www.herald-review.com/news/lo...cc4c002e0.html for example and see if you can come up with something better on the tags? I'd like to get rid of the spans, but I don't see how. |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Member
![]() Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle
|
I tried to reply to this yesterday, and apparently it requires moderator approval??
The code appears to run, but not to fix my problem. I think it's in the span's I'm using to parse the code (the keep_only_tags). Would that make a difference? |
![]() |
![]() |
![]() |
#7 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Quote:
|
||
![]() |
![]() |
![]() |
#8 |
Member
![]() Posts: 18
Karma: 10
Join Date: Feb 2010
Device: Kindle2
|
One nice thing about this paper is it is (for now) easy to figure out the printable version from the website.
Here is an article's url: ...business/local/article_084a9798-8890-557d-b091-37a611b9337e.html Here is the printable version of that same article: .../business/local/article_084a9798-8890-557d-b091-37a611b9337e.html?print=1 The only difference between the two is the second url appends '?print=1' to the end of the article. You can call print_version and get an easily readable format that way: def print_version(self, url): return url.replace('.html', '.html?print=1') It's no silver bullet. The printable version so far as I can tell does not copy graphics. It is readable though so that's something. |
![]() |
![]() |
![]() |
#9 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
If I understood his problem, it was just that he didn't have nice spacing between the parts he kept in the recipe. I'd rather have that problem than miss entire graphics (which is why I seldom use print_version - it often skips important parts of the full page.) I didn't have time to do the job for him, but the code I posted is how it is often done.
|
![]() |
![]() |
![]() |
#10 |
Member
![]() Posts: 18
Karma: 10
Join Date: Feb 2010
Device: Kindle2
|
Whoops! It seems as if I had serious misinterpreted the situation. Thanks for the help!
Last edited by rylsfan; 02-28-2011 at 09:49 PM. |
![]() |
![]() |
![]() |
#11 |
Member
![]() Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle
|
I spent some time on the code, and I'm just not seeing how to do this. It tags the parent tag with <p> tags, which I get, but my problem (I think) is that I'm grabbing spans and not a div's. These just kind of sit out there inside the body, with no parent tag around them.
Is there a way to put what I grab (for example, dict(name='img', attrs={'id':'img-holder'})) and put a tag around that? <Newtag><img id="img-holder" src="images/img1.jpg" alt=" " class="calibre12"></Newtag> I think if I could do that, I could add <p> tags or anything around the content. |
![]() |
![]() |
![]() |
#12 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Quote:
I took a look at your site. The parent of your img tag is the body tag. You can use that tag or play with the next sibling and previous sibling tags. You can't just tell BS to put the img inside another tag. You can create a p tag (with "Tag") and you can put your found img tag inside it, but now it's no longer in your page. It's hanging free and that p tag with the img inside it needs to be put into a tag in the tree forming your page. You'll need to use insert or replaceWith from BS to do that. Study the code I gave you, then study the BS docs at the link above to see how it's done. I suspect you'll come to understand why this is a pain to solve, and that's where we started. ![]() Last edited by Starson17; 03-02-2011 at 09:36 AM. |
||
![]() |
![]() |
![]() |
#13 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
The code I posted dealt with this issue by replacing the parent_tag (for an img tag embedded in an a tag). The a tag remained in the page soup, and the new p tag, with the img tag inside it, was used to replace the parent a tag. That was the first case in the if statement. The second case that author dealt with was where the parent tag was a p tag, not an a tag. Now you might ask why he was putting a p tag around the img if the parent tag was already a p tag. That's because the img tag wasn't the only content in the parent p tag. There was also text, and the author was trying to get the img onto a new line, separate from the other text. To do that, he created a div tag and a p tag. He put the img tag into the new p tag and put that into the new div tag. Now he's got a disconnected div tag with the p tag and img tag inside it. At this point, the img tag (and its image) are no longer on the page. He's going to have to put it back. The text inside the parent tag is still in the page. Next, he replaces the parent tag with the div tag. This deletes the parent tag from the page, but it's not lost. He still has a reference to that parent tag (minus the img tag he previously removed from it.) He sticks the parent tag into the div tag that is now in the page soup with "new_div.insert(1,parent_tag)". The index of 1 means that he puts the parent tag (with the text) in after the img tag (which has index 0.) The result is a new p tag around the img tag, followed by the original text in the original parent tag. So now you know why I said it's not that easy, and why many recipes don't bother to do this cleanup. All you have to do is figure out how to apply this to your page. ![]() Last edited by Starson17; 03-02-2011 at 04:33 PM. |
|
![]() |
![]() |
![]() |
#14 | |
Member
![]() Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle
|
Quote:
I can understand what you're saying, but I can't seem to make it work for my website. ![]() ![]() I'm trying something like this. Code:
def preprocess_html(self,soup): for pix in soup.findAll('img'): new_tag=tag(soup,'p') new_tag.insert(0,pix) pix.replaceWith(new_tag) return soup <img id="img-holder" src="xyz.jpg" alt=" " width="300px"> Next, it inserts a <p> tag around the img, returning: <p><img id="img-holder" src="xyz.jpg" alt=" " width="300px"></p> Finally, it takes all of this, and uses it in place of what the img before. Unfortunately, it doesn't work at all. I understand it's not part of the page anymore, but I figured it would stick it on the bottom or something. When I do this, most of the articles fail to even download. A couple more questions: Is there a way to step through this code and watch variables? I'm a lot more comfortable with vba, and you can do that easily in there. I am still thinking the problem is that my img's parent is the body. I don't know how to fix that. Here's an example what's around the img: Code:
<div id="blox-story-photo-container"> <span id="pictopiaURL" title="http://pictopia.com/perl/ptp/heraldreview"></span> <span id="siteHost" title="http://www.herald-review.com"></span> <div id="blox-large-photo-page"> <a name="photos"></a> <a href="http://bloximages.chicago2.vip.townnews.com/herald-review.com/content/tncms/assets/editorial/5/d5/290/5d529006-4493-11e0-b654-001cc4c002e0-revisions/4d6ddeaceb281.image.jpg" rel="facebox"> <img id="img-holder" src="http://bloximages.chicago2.vip.townnews.com/herald-review.com/content/tncms/assets/editorial/5/d5/290/5d529006-4493-11e0-b654-001cc4c002e0-revisions/4d6ddeacee535.preview-300.jpg" alt=" " width="300px"> </a> <p class="photo-cutline"> <a id="gallery-buy" href="http://pictopia.com/perl/ptp/heraldreview?photo_name=676ecb86-4493-11e0-8e81-001cc4c002e0&title=Bill Cole, E'Twaun Moore&t_url=http://bloximages.chicago2.vip.townnews.com/herald-review.com/content/tncms/assets/editorial/5/d5/290/5d529006-4493-11e0-b654-001cc4c002e0-revisions/4d6ddeaceb281.image.jpg&fs_url=http://bloximages.chicago2.vip.townnews.com/herald-review.com/content/tncms/assets/editorial/5/d5/290/5d529006-4493-11e0-b654-001cc4c002e0-revisions/4d6ddeacef8bc.hires.jpg&pps=buynow" rel="external"><img src="global/resources/images/buy-photo.gif" alt="buy this photo"></a> <span id="gallery-cutline">Purdue guard E'Twaun Moore, right, shoots over Illinois forward Bill Cole in the second half of an NCAA college basketball game in West Lafayette, Ind., Tuesday, March 1, 2011. Purdue defeated Illinois 75-67.</span> <span class="clear"></span> </p> </div> <div class="clear"></div> </div> |
|
![]() |
![]() |
![]() |
#15 | |||||||||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
![]() Besides, If I just do it, I have to actually make it work, whereas if I stick to telling you how to do it yourself, I can always hide behind the claim it's your fault it didn't work. ![]() Quote:
Quote:
Quote:
"Now he's got a disconnected div tag with the p tag and img tag inside it. At this point, the img tag (and its image) are no longer on the page. He's going to have to put it back. " In your case, you created a "disconnected" p tag as a new tag, then removed the img from your page. How can you "uses it in place of what the img before." if the page no longer has the img tag on it? It's now in the new p tag. You need a reference to a tag that's still on the page. Quote:
Quote:
Quote:
Code:
print 'My variable x is now: ', x Quote:
Quote:
Last edited by Starson17; 03-05-2011 at 09:52 AM. |
|||||||||
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
IQ Parse Error when downloading apps on IQ | tasha326 | PocketBook | 6 | 01-20-2011 12:09 AM |
Initial parse failed: | mburgoa | Calibre | 4 | 08-07-2010 08:50 AM |
I dont live in any of the subscription newspaper's cities... | kilofox | Amazon Kindle | 9 | 04-02-2008 04:33 PM |
from Italy...is PSR 505 good for newspaper's layout? | ionontelodico | Sony Reader | 5 | 12-20-2007 02:12 PM |