Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 03-07-2011, 09:53 PM   #16
clintiepoo
Member
clintiepoo began at the beginning.
 
Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle
I'm getting worse at this, and feeling fatigued at trying. I can't get this to do anything whatsoever. The command prompt won't print my variables, won't download the articles... it just is a failure. I still would like to step through and see where it's failing. I'm about done trying

Code:
    def preprocess_html(self,soup):
        for pix in soup.findAll('img'):
            next_tag=tag(soup, soup.body.nextSibling.name)
            new_tag=tag(soup,'p')
            new_tag.insert(0,pix)
            next_tag.insert(0,new_tag)
        return soup
clintiepoo is offline   Reply With Quote
Old 03-08-2011, 09:17 AM   #17
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by clintiepoo View Post
I'm about done trying
Try this:

Spoiler:
Code:
#!/usr/bin/env  python


'''
http://www.herald-review.com
'''
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag

class DecaturHerald(BasicNewsRecipe):
    title                 = u'Herald and Review'
    __author__            = u'Clint and Starson17'
    description           = u"Decatur, IL Newspaper"
    oldest_article        = 7
    language = 'en'

    max_articles_per_feed = 100
    no_stylesheets        = True
    use_embedded_content  = False
    extra_css = '''
                 h1               {text-align:left;}
                 .updated         {font-family:monospace;text-align:left;margin-bottom: 1em;}
                 .img             {text-align:center;}
                 .gallery-cutline {text-align:center;font-size:smaller;font-style:italic}
                 .credit          {text-align:right;margin-bottom:0em;font-size:smaller;}
                 .div             {text-align:left;}
                 '''
    
    cover_url = 'http://www.herald-review.com/content/tncms/live/global/resources/images/hr_logo.jpg'
    
    keep_only_tags = [ 
                        dict(name='h1'),
                        dict(name='span', attrs={'class':'updated'}),
                        dict(name='img', attrs={'id':'img-holder'}),
                        dict(name='span', attrs={'id':'gallery-cutline'}),                        
                        dict(name='div', attrs={'id':'blox-story-text'}) 
                     ]
                     
    remove_tags = [
                     dict(name='a')                 
                  ]       
                     
    feeds       = [ 
                    (u'Local News', u'http://www.herald-review.com/search/?f=rss&c[]=news/local&sd=desc&s=start_time'),
#                    (u'Breaking News', u'http://www.herald-review.com/search/?f=rss&k[]=%23breaking&sd=desc&s=start_time'),
#                    (u'State and Regional ', u'http://www.herald-review.com/search/?f=rss&c[]=news/state-and-regional&sd=desc&s=start_time'),
#                    (u'Crime and courts', u'http://www.herald-review.com/search/?f=rss&c[]=news/local/crime-and-courts&sd=desc&s=start_time'),
#                    (u'Local Business ', u'http://www.herald-review.com/search/?f=rss&c[]=business/local&sd=desc&s=start_time'),
#                    (u'Editorials', u'http://www.herald-review.com/search/?f=rss&c[]=news/opinion/editorial&sd=desc&s=start_time'),
#                    (u'Illini News', u'http://www.herald-review.com/search/?f=rss&q=illini&sd=desc&s=start_time')

                    ]

    def preprocess_html(self,soup):
        print 'the soup is: ', soup
        for img_tag in soup.findAll('img'):
            previousSibling_tag = img_tag.previousSibling
            if previousSibling_tag.name == 'span':
                new_tag = Tag(soup,'p')
                new_tag.insert(0,img_tag)
                previousSibling_tag.insert(1,new_tag)
        return soup

I used previousSibling to find the span tag that preceded the img tag. Since the span tag had useful text (the date), and was still in the soup, I used it as the marker and just put the img tag into it, after putting it into a p tag.
I didn't look closely at your code, but I did see it used "tag" instead of "Tag." Note the imports and the print, which you can comment out with "#".

Last edited by Starson17; 03-08-2011 at 09:22 AM.
Starson17 is offline   Reply With Quote
Advert
Old 03-09-2011, 05:51 PM   #18
clintiepoo
Member
clintiepoo began at the beginning.
 
Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle
Starson,

Thanks for helping with that. I was stuck, and I doubt I would have gotten it figured out. This is why you're a wizard and I'm a Jr. Member.

Another questions, if you don't mind:

How do I do a similar thing with other tags? I tried adding another for loop before the return soup, and it didn't want to take it. Can you call the preprocess_html twice, or what do you do?

I know you didn't teach me to fish, but maybe I can help take it off the hook once you reel it in.
clintiepoo is offline   Reply With Quote
Old 03-09-2011, 08:55 PM   #19
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by clintiepoo View Post
How do I do a similar thing with other tags?
The same way.
Quote:
I tried adding another for loop before the return soup, and it didn't want to take it.
Post your code. It should have worked.

Quote:
Can you call the preprocess_html twice
No
Starson17 is offline   Reply With Quote
Old 03-12-2011, 02:00 PM   #20
clintiepoo
Member
clintiepoo began at the beginning.
 
Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle
Quote:
Originally Posted by Starson17 View Post

Post your code. It should have worked.
Here are my tags. I'm working on the img and the fn.

Code:
    keep_only_tags = [ 
                        dict(name='h1'),
                        dict(name='span', attrs={'class':'updated'}),
                        dict(name='span', attrs={'class':'fn'}),
                        dict(name='img', attrs={'id':'img-holder'}),
                        dict(name='span', attrs={'id':'gallery-cutline'}),
                        dict(name='div', attrs={'id':'blox-story-text'})

                     ]
These tags are in order, so the previous sibling thing gets a little more confusing. I was trying to insert the fn, then the image. The fn tag works, but the image gets lost.

Code:
    def preprocess_html(self,soup):
#        print 'the soup is: ', soup
        for fn_tag in soup.findAll("span", {"class" : "fn"}):
            previousSibling_tag = fn_tag.previousSibling
            if previousSibling_tag.name == 'span':
                new_tag = Tag(soup,'p')
                new_tag.insert(0,fn_tag)
                previousSibling_tag.insert(1,new_tag)
        for img_tag in soup.findAll('img'):
            previousSibling_tag = img_tag.previousSibling
            if previousSibling_tag.name == 'span':
                new_tag = Tag(soup,'p')
                new_tag.insert(0,img_tag)
                previousSibling_tag.insert(2,new_tag)                
                
                
        return soup
clintiepoo is offline   Reply With Quote
Advert
Old 03-12-2011, 08:51 PM   #21
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by clintiepoo View Post
These tags are in order, so the previous sibling thing gets a little more confusing.
I'm not sure what you mean by the "tags are in order." Keeping tags causes them to be rearranged at the output by the order you keep them. preprocess_html works on the input.

Quote:
I was trying to insert the fn, then the image. The fn tag works, but the image gets lost.
Just put some print statements in to find out what the right order is. Put one in before you do your insert and then one after to check that it was inserted where you wanted. I suspect span isn't the prevoius sibling. You need to find out what it is.
Starson17 is offline   Reply With Quote
Old 03-13-2011, 01:10 AM   #22
clintiepoo
Member
clintiepoo began at the beginning.
 
Posts: 19
Karma: 10
Join Date: Feb 2011
Device: kindle
Edit: actually what I had worked beautifully. Here's the final preprocess code.

Code:
    def preprocess_html(self,soup):
#        print 'the soup is: ', soup
        for fn_tag in soup.findAll("span", {"class" : "fn"}):
            previousSibling_tag = fn_tag.previousSibling
            if previousSibling_tag.name == 'span':
                new_tag = Tag(soup,'p')
                new_tag.insert(0,fn_tag)
                previousSibling_tag.insert(1,new_tag)
        for img_tag in soup.findAll('img'):
            previousSibling_tag = img_tag.previousSibling
#            print 'img previoussibling is: ', previousSibling_tag
#            print 'previousSibling_tag.name is: ', previousSibling_tag.name
            if previousSibling_tag.name == 'span':
                new_tag = Tag(soup,'p')
#                print 'new_tag is: ', new_tag
                new_tag.insert(0,img_tag)
#                print 'new_tag is, after insert: ', new_tag                
                previousSibling_tag.insert(2,new_tag)                
#                print 'img previoussibling is after insert: ', previousSibling_tag
                
                
        return soup
Thank you so much for your help!

Last edited by clintiepoo; 03-13-2011 at 01:15 AM.
clintiepoo is offline   Reply With Quote
Old 03-13-2011, 11:01 AM   #23
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by clintiepoo View Post

Thank you so much for your help!
You're very welcome. I'm glad you got it working.
Starson17 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
IQ Parse Error when downloading apps on IQ tasha326 PocketBook 6 01-20-2011 12:09 AM
Initial parse failed: mburgoa Calibre 4 08-07-2010 08:50 AM
I dont live in any of the subscription newspaper's cities... kilofox Amazon Kindle 9 04-02-2008 04:33 PM
from Italy...is PSR 505 good for newspaper's layout? ionontelodico Sony Reader 5 12-20-2007 02:12 PM


All times are GMT -4. The time now is 03:10 AM.


MobileRead.com is a privately owned, operated and funded community.