maya recipe - Page 2

Starson17 · 10-11-2010, 03:34 PM

Quote:

Originally Posted by TonytheBookworm

I wish there was more documentation or a way to view the actual source to see what options are available under the the conversion_options function. Not trying to go against what your saying but based on this from the api documentation I'm led to believe that you can over-ride the conversion defaults in code/recipe.

I waited for Kovid to confirm. I've seen the answer to this before, but I didn't want to say the same thing twice. As for seeing the conversion options, I've seen them in the source.

Here's the list in my notes:

Code:

                ['change_justification', 'extra_css', 'base_font_size',
                    'font_size_mapping', 'line_height',
                    'linearize_tables', 'smarten_punctuation',
                    'disable_font_rescaling', 'insert_blank_line',
                    'remove_paragraph_spacing', 'remove_paragraph_spacing_indent_size','input_encoding',
                    'asciiize', 'keep_ligatures']

marbs · 10-11-2010, 03:43 PM

Quote:

Originally Posted by kovidgoyal

you cannot override the output format from within a recipe.

Trying to extract text from PDFs is not going to be easy. Just try converting your PDF in calibre to see what will happen.

converting in calibre. adding the recipe to my custom recipes and getting news? did that, the pdf articles come up as gibberish. and very long gibberish at that.

any other ideas?

BTW, thank you very much for the help Kovid.

some silly questions that i dont want sidetracking with my maya recipe.

Spoiler:

kovidgoyal · 10-11-2010, 03:55 PM

You're out of luck if your publication is only available as PDF, I'm afraid.

Starson17 · 10-11-2010, 04:18 PM

Quote:

Originally Posted by marbs

some silly questions that i dont want sidetracking with my maya recipe.
Starson17 (or anyone who knows), what does 'linearize_tables' do and how do you use it?
as a matter of fact, what do all of them do and how do you use them?
and if i am asking silly questions anyway, is there a way to add the description form the rss feed to the article as a header?

You are free to start another thread, if you want the question separate, but since you didn't

....

Mostly, linearize_tables replaces <table>, <tr> and <td> tags with <div>. Instead of a table, you get a single column. It's handled better by small screen devices. Most of the other options are available on the Conversion screen. You add them to the recipe as follows:
conversion_options = {'linearize_tables':True}
You can add additional options separated by commas.

marbs · 10-11-2010, 04:39 PM

Quote:

Originally Posted by kovidgoyal

You're out of luck if your publication is only available as PDF, I'm afraid.

i am willing to work hard, rewrite what ever i have to, in order to make this work. can i include a python library? there must be something that can be done.

kovidgoyal · 10-11-2010, 04:42 PM

PDF is not a format that will convert well, that's just the way it is. Dont think of PDF as an ebook format, think of it as a printed page. Now try to imagine writing an algorithm to convert a printed page to an ebook (that is essentially what all PDF conversion algorithms do).

marbs · 10-11-2010, 05:02 PM

but if my output is pdf in any case, and i think i read somewere that calibre converts all the articles and then merges them (i thin i saw something like that in the log file form the recipe) then why do anything? all i need is to get something that can merge the pdf files (the HTML articles that were converted and the pdf file) in order. maybe?

kovidgoyal · 10-11-2010, 05:37 PM

because calbre's recipe system (and calibre conversion system) work by fiorst converting the input to html.

marbs · 10-12-2010, 04:59 AM

i understand. ill try to get around it. thanks Kovid.

so i found this web site:
http://www.pdfdownload.org/free-pdf-to-html.aspx
it converts pdf to pics, page by page.
you can do it with out the form like this:

http://www.pdfdownload.org/pdf2html/pdf2html.php?url=your url here&images=yes

i wrote code to see how this works out. it doesn't.
any bright thoughts as to why i dont get my new HTML version of my articles?

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
class AlisonB(BasicNewsRecipe):
    title      = 'blah'
    __author__ = 'Tonythebookworm'
    description = 'blah'
    language = 'en'
    no_stylesheets = True
    publisher           = 'Tonythebookworm'
    simultaneous_downloads = 1
    delay                  = 25   
    category            = 'column'
    extra_css='body{direction: rtl;} .article_description{direction: rtl; } a.article{direction: rtl; } .calibre_feed_description{direction: rtl; }'
#    no_stylesheets = True
    use_embedded_content= False
    remove_attributes = ['width','height']
    no_stylesheets      = True
    oldest_article      = 24
    remove_javascript   = True
    remove_empty_feeds  = True
    
    max_articles_per_feed = 5000
    INDEX = 'http://maya.tase.co.il/'
    

    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"Feed", u"http://maya.tase.co.il/bursa/index.asp?view=search&company_group=3000&arg_comp=&srh_comp_lb=1007&srh_from=2010-01-01&srh_until=2010-09-28&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press="),
 #                           (u"Feed2", u"http://maya.tase.co.il/bursa/index.asp?view=yesterday"),                            
                            
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        print 'url is', url

#        print 'The soup is: ', soup
        for item in soup.findAll('a',attrs={'class':'A3'}):
            print 'item is: ',item
            #link = item.find('a')
            #titlecheck = self.tag_to_string(link)
            #url_test = re.search('javascript', item['href'])
           
            if not re.search('javascript', item['href']):
              temp2= self.INDEX + 'bursa/' + item['href']
        #      temp2=[temp3]
              print 'url1 is', temp2
              soup1 = self.index_to_soup(temp2)
  #            print 'the new soup is', soup1
              print '6714' 
              for item1 in soup1.findAll('iframe'):
                 print 'item1 is:' , item1
                 txt= item1['src']
                 print 'FOUND GOOD URL'
                 re1='.*?'	# Non-greedy match on filler
                 re2='(mayafiles)'	# Variable Name 1
                 re3='(.)'	# Any Single Character 1
                 re4='.*?'	# Non-greedy match on filler
                 re5='htm'	# Uninteresting: word
                 re6='.*?'	# Non-greedy match on filler
                 re7='(htm)'	# Word 1
                 
                 rg = re.compile(re1+re2+re3+re4+re5+re6+re7,re.IGNORECASE|re.DOTALL)
                 m = rg.search(txt)
                 if m:
                     var1=m.group(1)
                     c1=m.group(2)
                     word1=m.group(3)
                     print "("+var1+")"+"("+c1+")"+"("+word1+")"+"\n"
                     url = item1['src']
                 else:
                     url = 'http://www.pdfdownload.org/pdf2html/pdf2html.php?url=' + item1['src'] + '&images=yes'

                 print 'url is: ', url
                 title       = self.tag_to_string(item)
                 print 'title is: ', title
                 current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
            
           
        return current_articles

marbs · 10-18-2010, 05:01 AM

so i have changed what i what form my recipe. i will try to write the full version in pure python later, but now i want to do this as a recipe.

if you take a look here you will see a list links on the page. i want the article title to be "the 1st link text" - "the 2nd link text"
right now it is just "the 2nd link next". the id of the 1st link is "CompNmHref" i just dont know how to do it with the for loop and soup. is there a "tag before" command in soup? because we are talikng about the tag befor the "item" in my code...

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
class AlisonB(BasicNewsRecipe):
    title      ='Maya'
    __author__ = 'marbs'
    description = 'blah'
    language = 'en'
    no_stylesheets = True
    publisher           = 'marbs'
#    simultaneous_downloads = 1
#    delay                  = 25   
    category            = 'column'
    extra_css='body{direction: rtl;} .article_description{direction: rtl; } a.article{direction: rtl; } .calibre_feed_description{direction: rtl; }'
#    no_stylesheets = True
    use_embedded_content= False
    remove_attributes = ['width','height']
    no_stylesheets      = True
    oldest_article      = 24
    remove_javascript   = True
    remove_empty_feeds  = True
    
    max_articles_per_feed = 5000
    INDEX = 'http://maya.tase.co.il/'
    

    def parse_index(self):
        feeds = []
        for title, url in [
                            #(u"Feed", u"http://maya.tase.co.il/bursa/index.asp?view=search&company_group=3000&arg_comp=&srh_comp_lb=1007&srh_from=2010-01-01&srh_until=2010-09-28&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press="),
                            (u"הודעות מאתמול", u"http://maya.tase.co.il/bursa/index.asp?view=yesterday"),                            
                            
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        print 'url is', url

#        print 'The soup is: ', soup
        for item in soup.findAll('a',attrs={'class':'A3'}):
            print 'item is: ',item
            #link = item.find('a')
            #titlecheck = self.tag_to_string(link)
            #url_test = re.search('javascript', item['href'])
           
            if not re.search('javascript', item['href']):
              temp2= self.INDEX + 'bursa/' + item['href']
        #      temp2=[temp3]
              print 'url1 is', temp2
              soup1 = self.index_to_soup(temp2)
  #            print 'the new soup is', soup1
              print '6714' 
              for item1 in soup1.findAll('iframe'):
                 print 'item1 is:' , item1
                 txt= item1['src']
                 print 'FOUND GOOD URL'
                 re1='.*?'	# Non-greedy match on filler
                 re2='(mayafiles)'	# Variable Name 1
                 re3='(.)'	# Any Single Character 1
                 re4='.*?'	# Non-greedy match on filler
                 re5='htm'	# Uninteresting: word
                 re6='.*?'	# Non-greedy match on filler
                 re7='(htm)'	# Word 1
                 
                 rg = re.compile(re1+re2+re3+re4+re5+re6+re7,re.IGNORECASE|re.DOTALL)
                 m = rg.search(txt)
                 if m:
                     var1=m.group(1)
                     c1=m.group(2)
                     word1=m.group(3)
                     print "("+var1+")"+"("+c1+")"+"("+word1+")"+"\n"
                     url = item1['src']
                 else:
                     url = 'http://www.pdfdownload.org/pdf2html/pdf2html.php?url=' + item1['src'] + '&images=yes'

                 print 'url is: ', url
                 title       = self.tag_to_string(item)
                 print 'title is: ', title
                 current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
            
           
        return current_articles

Starson17 · 10-18-2010, 10:58 AM

Quote:

Originally Posted by marbs

is there a "tag before" command in soup? because we are talikng about the tag befor the "item" in my code...

There are two: previous and previousSibling.
See here:
http://www.crummy.com/software/Beaut...reviousSibling

marbs · 10-20-2010, 08:36 AM

which i dont, i want someting like this:
print soup.item.previous.previousSibling
i want to go to the previous <tr> tag and then i want the sibling befor that.
not working.

Starson17 · 10-20-2010, 09:35 AM

Quote:

Originally Posted by marbs

which i dont, i want someting like this:
print soup.item.previous.previousSibling
i want to go to the previous <tr> tag and then i want the sibling befor that.
not working.

Why not?

That's the question you're asking, and to answer it, you just print the entire soup, or the previous element, or the previous sibling to figure out where you've gone wrong. Be aware that you should look at the soup, and not just the page source. BeautifulSoup loads the page source into its database, and as it does that, it fixes errors and makes other modifications that may not be apparent in the page itself.

marbs · 10-23-2010, 01:55 PM

go it!

it works now.

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
class AlisonB(BasicNewsRecipe):
    title      ='Maya'
    __author__ = 'marbs'
    description = 'blah'
    language = 'en'
    no_stylesheets = True
    publisher           = 'marbs'
#    simultaneous_downloads = 1
#    delay                  = 25   
    category            = 'column'
    extra_css='body{direction: rtl;} .article_description{direction: rtl; } a.article{direction: rtl; } .calibre_feed_description{direction: rtl; }'
#    no_stylesheets = True
    use_embedded_content= False
    remove_attributes = ['width','height']
    no_stylesheets      = True
    oldest_article      = 24
    remove_javascript   = True
    remove_empty_feeds  = True
    
    max_articles_per_feed = 5000
    INDEX = 'http://maya.tase.co.il/'
    

    def parse_index(self):
        feeds = []
        for title, url in [
                            #(u"Feed", u"http://maya.tase.co.il/bursa/index.asp?view=search&company_group=3000&arg_comp=&srh_comp_lb=1007&srh_from=2010-01-01&srh_until=2010-09-28&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press="),
                            (u"הודעות מאתמול", u"http://maya.tase.co.il/bursa/index.asp?view=yesterday"),                            
                            
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        print 'url is', url

        #print 'The soup is: ', soup
        for item in soup.findAll('a',attrs={'class':'A3'}):
            print 'item is: ',item
            itemcomp= item.findPrevious('a',attrs={'id':'CompNmHref'})
            #link = item.find('a')
            #titlecheck = self.tag_to_string(link)
            #url_test = re.search('javascript', item['href'])
           
            if not re.search('javascript', item['href']):
              temp2= self.INDEX + 'bursa/' + item['href']
        #      temp2=[temp3]
              print 'url1 is', temp2
              soup1 = self.index_to_soup(temp2)
  #            print 'the new soup is', soup1
              print '6714' 
              for item1 in soup1.findAll('iframe'):
                 print 'item1 is:' , item1
 #                print soup.item.previous.previousSibling
                 txt= item1['src']
                 print 'FOUND GOOD URL'
                 re1='.*?'	# Non-greedy match on filler
                 re2='(mayafiles)'	# Variable Name 1
                 re3='(.)'	# Any Single Character 1
                 re4='.*?'	# Non-greedy match on filler
                 re5='htm'	# Uninteresting: word
                 re6='.*?'	# Non-greedy match on filler
                 re7='(htm)'	# Word 1
                 
                 rg = re.compile(re1+re2+re3+re4+re5+re6+re7,re.IGNORECASE|re.DOTALL)
                 m = rg.search(txt)
                 if m:
                     var1=m.group(1)
                     c1=m.group(2)
                     word1=m.group(3)
                     print "("+var1+")"+"("+c1+")"+"("+word1+")"+"\n"
                     url = item1['src']
                 else:
                     url = 'http://www.pdfdownload.org/pdf2html/pdf2html.php?url=' + item1['src'] + '&images=yes'

                 print 'url is: ', url

                 title       = self.tag_to_string(itemcomp)+ ' - ' + self.tag_to_string(item)
                 print 'title is: ', title
                 current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
            
           
        return current_articles

now i want to build a loop like in C language. ill write it in psudo code:

index = 1
if the article count reaches 30, then post request rsSearchRes_pgNo=index + 1

my instinct says i would do it with recursion. but i am not sure that is wise python....

can you point me in the right direction?

Starson17 · 10-23-2010, 02:17 PM

Quote:

Originally Posted by marbs

it works now.

Congratulations.

Quote:

now i want to build a loop like in C language. ill write it in psudo code:

index = 1
if the article count reaches 30, then post request rsSearchRes_pgNo=index + 1

Can you flesh this out a bit? You want to count 30 articles and then do what?

10-20-2010, 08:36 AM	#27
marbs Zealot Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook	if i understand correctly which i dont, i want someting like this: print soup.item.previous.previousSibling i want to go to the previous <tr> tag and then i want the sibling befor that. not working.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
New recipe	kiklop74	Recipes	0	10-05-2010 04:41 PM
New recipe	kiklop74	Recipes	0	10-01-2010 02:42 PM
New Title from Book View Cafe: A Princess of Passyunk by Maya Kaathryn Bohnhoff	suelange	Self-Promotions by Authors and Publishers	0	08-11-2010 04:35 PM
Recipe Help	lrain5	Calibre	3	05-09-2010 10:42 PM
Recipe Help Please	estral	Calibre	1	06-11-2009 02:35 PM

10-11-2010, 03:55 PM	#18
kovidgoyal creator of calibre Posts: 44,017 Karma: 22669822 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You're out of luck if your publication is only available as PDF, I'm afraid.

10-11-2010, 04:42 PM	#21
kovidgoyal creator of calibre Posts: 44,017 Karma: 22669822 Join Date: Oct 2006 Location: Mumbai, India Device: Various	PDF is not a format that will convert well, that's just the way it is. Dont think of PDF as an ebook format, think of it as a printed page. Now try to imagine writing an algorithm to convert a printed page to an ebook (that is essentially what all PDF conversion algorithms do).

10-11-2010, 05:02 PM	#22
marbs Zealot Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook	but if my output is pdf in any case, and i think i read somewere that calibre converts all the articles and then merges them (i thin i saw something like that in the log file form the recipe) then why do anything? all i need is to get something that can merge the pdf files (the HTML articles that were converted and the pdf file) in order. maybe?

10-11-2010, 05:37 PM	#23
kovidgoyal creator of calibre Posts: 44,017 Karma: 22669822 Join Date: Oct 2006 Location: Mumbai, India Device: Various	because calbre's recipe system (and calibre conversion system) work by fiorst converting the input to html.

Advert

Advert