maya recipe

marbs · 09-28-2010, 03:57 PM

this one is hard, and it is Hebrew only.

i want to do this step by step, so i understand what i am doing.
i want to create a recipe for this page and similar pages.

if you go in to the page, you will see a list of articles (right hand side of the page). the actual link to the article is the 2nd link in each pair. i have recognized that that all the relevant links (and only them) have an id="SubjectHref*" (the * represents some numbers).

the URLs i want to get in stage one is 'http://maya.tase.co.il/' + [the href from tag <a> with id SubjectHref*].
i have then need to do the same in the next page (see the bottom of the page)

this is the code i have so far and i am a little lost now. its built on the NZ herald recipe. can some one tell me if this is the right way?

Spoiler:

Code:

from calibre.web.feeds.recipes import BasicNewsRecipe

class NewZealandHerald(BasicNewsRecipe):

    title       = 'maya recipe'
    __author__  = 'marbs'
    description = 'Daily news'
    timefmt = ' [%d %b, %Y]'
    language = '_Hebrew'

#    no_stylesheets = True
#     remove_tags_before = dict(name='div', attrs={'class':'contentContainer left eight'})
   # remove_tags_after  = dict(name='div', attrs={'class':'callToAction'})
   # remove_tags = [
   #    dict(name='iframe'),
   #    dict(name='div', attrs={'class':['sectionHeader', 'tools','callToAction', 'contentContainer right two nopad relatedColumn']}),
       #dict(name='div', attrs={'id':['shareContainer']}),
       #dict(name='form', attrs={'onsubmit':"return verifySearch(this.w,'Keyword, citation, or #author')"}),
       #dict(name='table', attrs={'cellspacing':'0'}),
#    ]

#    def preprocess_html(self, soup):
#        table = soup.find('table')
#        if table is not None:
#            table.extract()
#        return soup

    #TO GET ARTICLES IN SECTION
    def maya_parse_section(self, url):
            soup = self.index_to_soup(url)
            div = soup.find(attrs={'id':('SubjectHref'+"*")})
            date = div.find(attrs={'href'})

            current_articles = []
            for x in date.findAllNext(attrs = {'id': ('SubjectHref'+"*")}):
                if x = 30: break
                for li in x.findAll('li'):
                    a = li.find('a', href=True)
                    if a is None:
                        continue
                    title = self.tag_to_string(a)
                    url = a.get('href', False)
                    if not url or not title:
                        continue
                    if url.startswith('/'):
                         url = 'http://www.nzherald.co.nz'+url
                    self.log('\t\tFound article:', title)
                    self.log('\t\t\t', url)
                    current_articles.append({'title': title, 'url':url,
                        'description':'', 'date':''})

            return current_articles


    # To GET SECTIONS
    def parse_index(self):
            feeds = ['example feed', 'http://maya.tase.co.il/bursa/index.asp?view=search&company_group=3000&arg_comp=&srh_comp_lb=1007&srh_from=2010-01-01&srh_until=2010-09-28&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press=']

               articles = self.maya_parse_section(url)
               if articles:
                   feeds.append((title, articles))
            return feeds

TonytheBookworm · 09-28-2010, 04:23 PM

On first look at that thing, why not do something in the area of this:
You said you wanted the second link
for example: http://maya.tase.co.il/bursa/report....port_cd=570152

it always has report_cd in it so why not just follow it with a regex match ?

Spoiler:

or maybe use something like this:

Spoiler:

marbs · 09-30-2010, 03:56 PM

the reason i didn't use regex to follow the link is because i haven't wrapped my head around it yet. i don't fully understand the concept. i tryed running your code.
when i used the 1st one i got raw HTML from the feed page.
when i used the 2nd code i got "NameError: global name 're' is not defined"
ill have to read it a bit more (after a good nights sleep.)

i am going to work on it some more....

TonytheBookworm · 09-30-2010, 04:09 PM

Quote:

Originally Posted by marbs

the reason i didn't use regex to follow the link is because i haven't wrapped my head around it yet. i don't fully understand the concept. i tryed running your code.
when i used the 1st one i got raw HTML from the feed page.
when i used the 2nd code i got "NameError: global name 're' is not defined"
ill have to read it a bit more (after a good nights sleep.)

i am going to work on it some more....

you have to import re

the second set of code works. I tested it.
It is up to you to clean it up and get what you want and get rid of what you don't.
but as far as getting the link you wanted here is what i did.

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
class AlisonB(BasicNewsRecipe):
    title      = 'blah'
    __author__ = 'Tonythebookworm'
    description = 'blah'
    language = 'en'
    no_stylesheets = True
    publisher           = 'Tonythebookworm'
    category            = 'column'
    use_embedded_content= False
    no_stylesheets      = True
    oldest_article      = 24
    remove_javascript   = True
    remove_empty_feeds  = True
    
    max_articles_per_feed = 10
    INDEX = '"http://maya.tase.co.il/'
    

    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"Feed", u"http://maya.tase.co.il/bursa/index.asp?view=search&company_group=3000&arg_comp=&srh_comp_lb=1007&srh_from=2010-01-01&srh_until=2010-09-28&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press="),
                            
                            
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        print 'The soup is: ', soup
        for item in soup.findAll('a',attrs={'class':'A3'}):
            print 'item is: ',item
            #link = item.find('a')
            #titlecheck = self.tag_to_string(link)
            #url_test = re.search('javascript', item['href'])
           
            if not re.search('javascript', item['href']):
              print 'FOUND GOOD URL'
              url         = self.INDEX + item['href']
              print 'url is: ', url
              title       = self.tag_to_string(item)
              print 'title is: ', title
            current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
            
           
        return current_articles

marbs · 10-01-2010, 03:12 AM

and you are right, i works.
so i wanted to take it to the next step. on the urls that you found, there is the clean version of the reports i am trying to get.
it is the "src" attr from the iframe tag (in some cases, i want to do this step by step).
so i added a sub function. i gave it all the information in needs to do what you did.

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
class AlisonB(BasicNewsRecipe):
    title      = 'blah'
    __author__ = 'Tonythebookworm'
    description = 'blah'
    language = 'en'
    no_stylesheets = True
    publisher           = 'Tonythebookworm'
    category            = 'column'
    use_embedded_content= False
    no_stylesheets      = True
    oldest_article      = 24
    remove_javascript   = True
    remove_empty_feeds  = True
    
    max_articles_per_feed = 10
    INDEX = '"http://maya.tase.co.il/'

    def make_links1(self, url, title, description, date):
        title = 'Temp1'
        current_articles1 = []
        soup = self.index_to_soup(url)
        for item in soup.findAll('iframe'):
             print 'FOUND GOOD URL'
             url         =  item['src']
             print 'url is: ', url

        current_articles1.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this           
        return current_articles1
    

    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"Feed", u"http://maya.tase.co.il/bursa/index.asp?view=search&company_group=3000&arg_comp=&srh_comp_lb=1007&srh_from=2010-01-01&srh_until=2010-09-28&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press="),
                            
                            
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        print 'The soup is: ', soup
        for item in soup.findAll('a',attrs={'class':'A3'}):
            print 'item is: ',item
            #link = item.find('a')
            #titlecheck = self.tag_to_string(link)
            #url_test = re.search('javascript', item['href'])
           
            if not re.search('javascript', item['href']):
               title       = self.tag_to_string(item)
               print 'title is: ', title
               current_articles=make_links1(url, title, description, date)
  
        return current_articles

when i run it, i get "NameError: global name 'make_links1' is not defined"
it looks right to me, i have no idea what i did wrong.

TonytheBookworm · 10-01-2010, 03:14 PM

make_links is a built in function

so of course make_links1 is not valid

In other words put the new stuff in with the old stuff because there is no point in reinventing the wheel.

I don't have the time to debug your code but basically do a for loop to find some stuff and then do whatever you need to with it. then do another for loop to find other stuff. and append it to the article list like i showed you. You can even take and rename title to temp1 after the first for loop if you like.

marbs · 10-02-2010, 01:27 PM

i changed it to fit. mu 2nd call of soup it not opening the url (temp2) and souping it(the html file that the url leeds to). it is just souping the url it self.what am i doing wrong?

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
class AlisonB(BasicNewsRecipe):
    title      = 'blah'
    __author__ = 'Tonythebookworm'
    description = 'blah'
    language = 'en'
    no_stylesheets = True
    publisher           = 'Tonythebookworm'
    category            = 'column'
    use_embedded_content= False
    no_stylesheets      = True
    oldest_article      = 24
    remove_javascript   = True
    remove_empty_feeds  = True
    
    max_articles_per_feed = 10
    INDEX = '"http://maya.tase.co.il/'
    

    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"Feed", u"http://maya.tase.co.il/bursa/index.asp?view=search&company_group=3000&arg_comp=&srh_comp_lb=1007&srh_from=2010-01-01&srh_until=2010-09-28&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press="),
                            
                            
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        print 'The soup is: ', soup
        for item in soup.findAll('a',attrs={'class':'A3'}):
            print 'item is: ',item
            #link = item.find('a')
            #titlecheck = self.tag_to_string(link)
            #url_test = re.search('javascript', item['href'])
           
            if not re.search('javascript', item['href']):
              temp2= self.INDEX + item['href']
              print 'url1 is', temp2
              soup1 = self.index_to_soup(temp2)
              print 'the new soup is', temp2
              print '6714' 
              for item1 in soup1.findAll('iframe'):
                 print 'item1 is:' , item1
                 print 'FOUND GOOD URL'
                 url = item1['src']
                 print 'url is: ', url
                 title       = self.tag_to_string(item)
                 print 'title is: ', title
                 current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
            
           
        return current_articles

marbs · 10-07-2010, 04:19 PM

can i soup the url that i found in the 1st soup

Starson17 · 10-07-2010, 09:42 PM

Quote:

Originally Posted by marbs

can i soup the url that i found in the 1st soup

Yes..

marbs · 10-08-2010, 02:08 AM

but i need some help fixing it.
i marked it HERE in the code. i want to format the url as u"www....com" i am giving it a simple string. i tried ' and " and []. still cant get the syntax right. can i get some help with that?

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
class AlisonB(BasicNewsRecipe):
    title      = 'blah'
    __author__ = 'Tonythebookworm'
    description = 'blah'
    language = 'en'
    no_stylesheets = True
    publisher           = 'Tonythebookworm'
    category            = 'column'
    use_embedded_content= False
    no_stylesheets      = True
    oldest_article      = 24
    remove_javascript   = True
    remove_empty_feeds  = True
    
    max_articles_per_feed = 10
    INDEX = '"http://maya.tase.co.il/'
    

    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"Feed", u"http://maya.tase.co.il/bursa/index.asp?view=search&company_group=3000&arg_comp=&srh_comp_lb=1007&srh_from=2010-01-01&srh_until=2010-09-28&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press="),
                            
                            
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        print 'url is', url

        print 'The soup is: ', soup
        for item in soup.findAll('a',attrs={'class':'A3'}):
            print 'item is: ',item
            #link = item.find('a')
            #titlecheck = self.tag_to_string(link)
            #url_test = re.search('javascript', item['href'])
           
            if not re.search('javascript', item['href']):
              temp3= self.INDEX + item['href']                                #HERE
             # temp2=[temp3]
              print 'url1 is', temp2
              soup1 = self.index_to_soup(temp3)                            #AND HERE
              print 'the new soup is', temp2
              print '6714' 
              for item1 in soup1.findAll('iframe'):
                 print 'item1 is:' , item1
                 print 'FOUND GOOD URL'
                 url = item1['src']
                 print 'url is: ', url
                 title       = self.tag_to_string(item)
                 print 'title is: ', title
                 current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
            
           
        return current_articles

Starson17 · 10-08-2010, 09:13 AM

Quote:

Originally Posted by marbs

but i need some help fixing it.
i marked it HERE in the code. i want to format the url as u"www....com" i am giving it a simple string. i tried ' and " and []. still cant get the syntax right. can i get some help with that?

It looks correct to me, except:

Quote:

Code:

    INDEX = '"http://maya.tase.co.il/'

This has an extra leading quote - it should be

Quote:

Code:

    INDEX = 'http://maya.tase.co.il/'

And: don't you want 'the new soup is', soup1 instead of 'the new soup is', temp2
And: I don't see any iframes in soup1?

Quote:

Spoiler:

marbs · 10-09-2010, 05:12 PM

so thanks starson17. it now downloads my articles (there is still a lot of work, but i get news ant the end, and not an error).

i didnt think of this when i started, but and calibre deal with pdf files?
some of the reports come in pdf form. i get gibrish where the pdf use to be. can i do anything about it? does it matter if my output format is pdf?

this is the code:

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
class AlisonB(BasicNewsRecipe):
    title      = 'blah'
    __author__ = 'Tonythebookworm'
    description = 'blah'
    language = 'en'
    no_stylesheets = True
    publisher           = 'Tonythebookworm'
    category            = 'column'
    use_embedded_content= False
    no_stylesheets      = True
    oldest_article      = 24
    remove_javascript   = True
    remove_empty_feeds  = True
    
    max_articles_per_feed = 10
    INDEX = 'http://maya.tase.co.il/'
    

    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"Feed", u"http://maya.tase.co.il/bursa/index.asp?view=search&company_group=3000&arg_comp=&srh_comp_lb=1007&srh_from=2010-01-01&srh_until=2010-09-28&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press="),
                            
                            
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        print 'url is', url

#        print 'The soup is: ', soup
        for item in soup.findAll('a',attrs={'class':'A3'}):
            print 'item is: ',item
            #link = item.find('a')
            #titlecheck = self.tag_to_string(link)
            #url_test = re.search('javascript', item['href'])
           
            if not re.search('javascript', item['href']):
              temp2= self.INDEX + 'bursa/' + item['href']
        #      temp2=[temp3]
              print 'url1 is', temp2
              soup1 = self.index_to_soup(temp2)
  #            print 'the new soup is', soup1
              print '6714' 
              for item1 in soup1.findAll('iframe'):
                 print 'item1 is:' , item1
                 print 'FOUND GOOD URL'
                 url = item1['src']
                 print 'url is: ', url
                 title       = self.tag_to_string(item)
                 print 'title is: ', title
                 current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
            
           
        return current_articles

the 2nd article is a pdf file. (i am working with a feed that is very rarely updated, so i know the page format very well.)
can i import a library that deals with pdf?
thanks for the help.

ps
i also wanted to know if you can add an output file type to the recipe it self that will override the default for calibre (if the default is pdf, but i want one self built recipe to come out as epub?)

Starson17 · 10-09-2010, 09:34 PM

Quote:

Originally Posted by marbs

i didnt think of this when i started, but and calibre deal with pdf files?

I've never seen it done with a recipe.

Quote:

ps
i also wanted to know if you can add an output file type to the recipe it self that will override the default for calibre (if the default is pdf, but i want one self built recipe to come out as epub?)

Again, I'm afraid the answer is no. You could put together a script and do it externally with ebook-convert by specifying the desired format.

TonytheBookworm · 10-09-2010, 10:01 PM

Quote:

Originally Posted by Starson17

Again, I'm afraid the answer is no. You could put together a script and do it externally with ebook-convert by specifying the desired format.

I wish there was more documentation or a way to view the actual source to see what options are available under the the conversion_options function. Not trying to go against what your saying but based on this from the api documentation I'm led to believe that you can over-ride the conversion defaults in code/recipe.

Spoiler:

I'm just not sure what the actual variable name is. Maybe it is 'output_format': epub or something like that. Kovid can you chime in on this one please ?

kovidgoyal · 10-11-2010, 03:19 PM

you cannot override the output format from within a recipe.

Trying to extract text from PDFs is not going to be easy. Just try converting your PDF in calibre to see what will happen.

09-30-2010, 03:56 PM	#3
marbs Zealot Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook	Hey Tony! the reason i didn't use regex to follow the link is because i haven't wrapped my head around it yet. i don't fully understand the concept. i tryed running your code. when i used the 1st one i got raw HTML from the feed page. when i used the 2nd code i got "NameError: global name 're' is not defined" ill have to read it a bit more (after a good nights sleep.) i am going to work on it some more....

10-01-2010, 03:14 PM	#6
TonytheBookworm Addict Posts: 264 Karma: 62 Join Date: May 2010 Device: kindle 2, kindle 3, Kindle fire	make_links is a built in function so of course make_links1 is not valid In other words put the new stuff in with the old stuff because there is no point in reinventing the wheel. I don't have the time to debug your code but basically do a for loop to find some stuff and then do whatever you need to with it. then do another for loop to find other stuff. and append it to the article list like i showed you. You can even take and rename title to temp1 after the first for loop if you like. Last edited by TonytheBookworm; 10-01-2010 at 03:24 PM.

10-07-2010, 04:19 PM	#8
marbs Zealot Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook	is it possible to soup twice? can i soup the url that i found in the 1st soup

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
New recipe	kiklop74	Recipes	0	10-05-2010 04:41 PM
New recipe	kiklop74	Recipes	0	10-01-2010 02:42 PM
New Title from Book View Cafe: A Princess of Passyunk by Maya Kaathryn Bohnhoff	suelange	Self-Promotions by Authors and Publishers	0	08-11-2010 04:35 PM
Recipe Help	lrain5	Calibre	3	05-09-2010 10:42 PM
Recipe Help Please	estral	Calibre	1	06-11-2009 02:35 PM

10-11-2010, 03:19 PM	#15
kovidgoyal creator of calibre Posts: 43,866 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	you cannot override the output format from within a recipe. Trying to extract text from PDFs is not going to be easy. Just try converting your PDF in calibre to see what will happen.

Advert

Advert