maya recipe - Page 5

marbs · 10-28-2010, 02:50 PM

i can parse page 1 no problem. is the page that comes up automatically.

i cant get to page 2.

and this is as far that my code gets to:

Spoiler:

marbs · 10-30-2010, 05:15 PM

cleared my head a bit and i want to dive back in.

do you have any ideas about why i can get to page 2?

Starson17 · 11-01-2010, 11:54 AM

Quote:

Originally Posted by marbs

ill give it a try.
you haven't shared your idea about maya yet.

Switching back to this thread.
You asked if I had any thoughts - I did - I wondered what the maya site was all about! That's what I shared.

OK, yes, I thought about it a bit.

I still think you need to simplify - get your recipe to pull one page - page 2. If you can't do that, you'll get nowhere. We can talk about the simplified recipe issue, but it's less interesting.

Next, let's talk about get_browser. When you initialize with
br = BasicNewsRecipe.get_browser()
You start a browser session that's used from then on. Normally, you go to one or two pages to set up the login, store cookies, get header info for authentication, etc. From then on that browser session is used. If you retrieved the right cookies, set up login, etc. it all works.

You want to do something a bit different. You don't want the same thing every time (authentication header sent each request or cookies from login stored for each request) You want to do a POST that differs for each multiple page. I think you're creating the POST data, but I'm not convinced you've looked at the site closely enough to be sure of how it works for each step. I know I haven't (and don't plan to - sorry - but this site is not of general enough interest for its complexity).

Basically, I'd be looking more closely at the first interaction inside Firefox. Suppose you clear the cookies and cache, turn on TamperData and request page 2 before you request page 1. Can you get it? If not, can you get it after getting page 1? Is there any requirement for getting any other page first? Any referer requirement? It's very easy to get confused when using FireFox if it collects cookies that you aren't thinking about, or sends referer info, etc. The bottom line is I always make sure I know the whole detailed interaction in FireFox, then reproduce what it did inside the recipe, or reproduce the recipe function inside FireFox until they match and are doing the same thing.

I've never had that fail, but I've often been confused and thought I was seeing the same thing in each, but was wrong. Eventually the difference gets tracked down and the recipe begins doing what I see happening in FireFox.

marbs · 11-01-2010, 05:10 PM

good answer!

ill get on it. this next question is a bit off the recipe topic. how do i create a POST request from thin air in tamper data (if i clear the cookies and cache, then turn on tamper data, where will the post come from)?
also, page one comes up just by entering the site, how do i skip to page two right away?

and a final question, tamper data has a friend. http somthing. should i try and use that?

marbs · 11-02-2010, 05:40 AM

i cant belive i got there. thank you very much Starson!

now i am not sure what i do with all the new pages i can get. how do i finish append_page i dont see it returns anything in this example or any of the others. some more help?

edit:
it seems like br.follow_link does not actually open a page in the browser, it gets the responce, but i dont know how to have br. have the new page in it. is there a way to open the link or read the response somehow as a web page in the browser?

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
import urllib, mechanize
from calibre import __appname__

class AlisonB(BasicNewsRecipe):
    title      ='Maya v1.0'
    __author__ = 'marbs'
    description = 'blah'
    language = 'en'
    no_stylesheets = True
    publisher           = 'marbs'
#    simultaneous_downloads = 1
#    delay                  = 25   
    category            = 'column'
    extra_css='body{direction: rtl;} .article_description{direction: rtl; } a.article{direction: rtl; } .calibre_feed_description{direction: rtl; }'
#    no_stylesheets = True
    use_embedded_content= False
    remove_attributes = ['width','height']
    no_stylesheets      = True
    oldest_article      = 24
    remove_javascript   = True
    remove_empty_feeds  = True
    rec_index = 0
    max_articles_per_feed = 5000
    INDEX = 'http://maya.tase.co.il/'

    def append_page(self, url, soup, appendtag, articles_left, br,position):
   #     print url, 'the soup is:' ,soup, 'appendtag is',appendtag,  articles_left, position
        articles_left = articles_left - 30
        articlenum= articles_left + (30*position)
        position= position +1
        print  articles_left, articlenum
        if (articles_left <0) :
           print 'do i get this far?'
           # Print HTTP headers.
           br.set_debug_http(True)
  #         br.set_debug_http(True)
#           nexturl = br.open(url, request)
           nexturl = br.follow_link(mechanize.Link(base_url = '', url = url, text = '', tag = '', attrs = [{'id':'BTNNEXT'}]))
           print  'this is ok'
           html = nexturl.read()
  #         print 'got this too', html
           soup2 = self.index_to_soup(html)
           print 'this is my real tesst', soup2
           self.append_page(url,soup2,soup2.body,articles_left,br,position)
           texttag =soup2.body
           appendtag.insert(position,texttag)
######start appending id divMoeny

#           soup3=soup + soup2
#           self.append_page(url,soup, soup.body, report3,1)
           
#           texttag = soup2.find('div', attrs={'class':'bodytext'})
#           for it in texttag.findAll(style=True):
#               del it['style']
#           newpos = len(texttag.contents)
#           self.append_page(soup2,texttag,newpos)
#           texttag.extract()
#           appendtag.insert(position,texttag)
#           self.append_page(url,soup, soup.body, report3,1)
#
    

    def parse_index(self):
        feeds = []
        for title, url in [
                       #     (u"too long",u"http://maya.tase.co.il/bursa/index.asp?view=search&company_group=3000&arg_comp=&srh_comp_lb=&srh_from=2010-01-01&srh_until=2010-10-28&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press="),
                        #(u"Feed", u"http://maya.tase.co.il/bursa/index.asp?view=search&company_group=3000&arg_comp=&srh_comp_lb=1007&srh_from=2010-01-01&srh_until=2010-09-28&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press="),
                            (u"הודעות מאתמול", u"http://maya.tase.co.il/bursa/index.asp?view=yesterday"),                            
                            
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        print 'url is', url

#        print 'The soup is: ', soup
        stop = soup.find('td',attrs={'height':'19'})
        print 'the stop is', stop
        report1 = stop.contents[1].contents 
        print report1
        report2 = report1[0]
        print report2
        report3=int(report2.encode('ascii'))
        print report3
        br = BasicNewsRecipe.get_browser(self)
        br.open(url)

        self.append_page(url,soup, soup.body, report3,br,1)
        for item in soup.findAll('a',attrs={'class':'A3'}):
            print 'item is: ',item
            itemcomp= item.findPrevious('a',attrs={'id':'CompNmHref'})
            itemdate= item.findPrevious('font',attrs={'id':'DateHref'})
            #link = item.find('a')
            #titlecheck = self.tag_to_string(link)
            #url_test = re.search('javascript', item['href'])
           
            if not re.search('javascript', item['href']):
              temp2= self.INDEX + 'bursa/' + item['href']
        #      temp2=[temp3]
              print 'url1 is', temp2
              soup1 = self.index_to_soup(temp2)
  #            print 'the new soup is', soup1
              print '6714' 
              for item1 in soup1.findAll('iframe'):
                 print 'item1 is:' , item1
 #                print soup.item.previous.previousSibling
                 txt= item1['src']
                 print 'FOUND GOOD URL'
                 re1='.*?'	# Non-greedy match on filler
                 re2='(mayafiles)'	# Variable Name 1
                 re3='(.)'	# Any Single Character 1
                 re4='.*?'	# Non-greedy match on filler
                 re5='htm'	# Uninteresting: word
                 re6='.*?'	# Non-greedy match on filler
                 re7='(htm)'	# Word 1
                 
                 rg = re.compile(re1+re2+re3+re4+re5+re6+re7,re.IGNORECASE|re.DOTALL)
                 m = rg.search(txt)
                 if m:
                     var1=m.group(1)
                     c1=m.group(2)
                     word1=m.group(3)
                     print "("+var1+")"+"("+c1+")"+"("+word1+")"+"\n"
                     url = item1['src']
                 else:
                     url = 'http://www.pdfdownload.org/pdf2html/pdf2html.php?url=' + item1['src'] + '&images=yes'

                 print 'url is: ', url

                 title       = self.tag_to_string(itemcomp)+ ' - ' + self.tag_to_string(item)
                 print 'title is: ', title
                 date =self.tag_to_string(itemdate) 
                 current_articles.append({'title': title, 'url': url, 'description':'', 'date':date}) # append all this
            
           
        return current_articles

Starson17 · 11-03-2010, 03:41 PM

Quote:

Originally Posted by marbs

i cant belive i got there. thank you very much Starson!

Congratulations!

Quote:

now i am not sure what i do with all the new pages i can get. how do i finish append_page i dont see it returns anything in this example or any of the others. some more help?

append_page does nothing until it is used in preprocess_html as:
self.append_page(soup, soup.body, 3)
It's recursive, and grabs the current page in soup form from the "soup" parameter of the article being processed in preprocess_html. That page will have a "Next Page" button or equivalent, and when append_page is correctly written, it creates a new url from the url in the "Next Page" button, grabs the content of that new page, tacks it on to the bottom of the content in the current page, then recursively does it again, finding rhe "Next Page" button on page 2 to go to page 3, etc.

Quote:

edit:
it seems like br.follow_link does not actually open a page in the browser, it gets the responce, but i dont know how to have br. have the new page in it. is there a way to open the link or read the response somehow as a web page in the browser?

Look at any recipe that uses obfuscated feed links to print pages to see how it's usually done.

marbs · 11-03-2010, 04:30 PM

Quote:

Originally Posted by Starson17

Congratulations!

append_page does nothing until it is used in preprocess_html as:
self.append_page(soup, soup.body, 3)

Look at any recipe that uses obfuscated feed links to print pages to see how it's usually done.

are you sure it needs to be used in preprocess_html and exactly as self.append_page(soup, soup.body, 3)?

Starson17 · 11-03-2010, 04:56 PM

Quote:

Originally Posted by marbs

are you sure it needs to be used in preprocess_html

I don't think I said it did. You can use it anywhere you have a soup.

Quote:

and exactly as self.append_page(soup, soup.body, 3)?

You can change the starting parameters. They change automatically as it recurses to put page 2 directly below the end of page 1.

marbs · 11-06-2010, 04:10 PM

do you know of any way to run java scripts in python?

do you think Kovid would be willing to build in a tool like that?

edit:
ran a short search, maybe i tool like this?

Starson17 · 11-06-2010, 07:47 PM

Quote:

Originally Posted by marbs

do you know of any way to run java scripts in python?

I haven't done it. Mechanize is used by the recipe code, and I know it won't do it. I was trying to do it at one time, and looked at python-spidermonkey a bit, but decided I could just emulate what the js was doing.
https://github.com/davisp/python-spidermonkey

Quote:

do you think Kovid would be willing to build in a tool like that?

If you hand him the code, probably yes. I suspect it wouldn't be high on the todo list, as I have yet to see a case that really requires it.

marbs · 11-09-2010, 04:09 PM

i am not sure where to post a question to Kovid, so i hope you see this.

i wanted to know what the chances are on getting JS support for recipes? maybe this? i haven't read it really, but i am sure python can support JS.

and what is the chance is for getting support for pdf articles. i know you said that it is a printed book and not a book, but with out understanding anything about it, i feel that it might be possible to have pdf articles to pdf outputs skip the conversion in the middle and just be included in the end news feed somehow?

kovidgoyal · 11-09-2010, 05:42 PM

Adding a javascript interpreter to python is harder than it sounds, or I would have done it already. And using QtWebKit is out of the question as it requires an X server to run, which means that the news download system would no longer work on headless servers.

As for PDF, no that's not possible.

KRorschachZ · 11-10-2010, 02:23 PM

Quote:

Originally Posted by kovidgoyal

As for PDF, ...

hi, I have one question general about PDF, a little near offtopic here, but... i don´t see where insert this:

What would be possible in future to implement this option:?

when "calibre" is running a recipe, if the link is a PDF file link, and meets size specifications and conditions listed in the recipe code, to be discharged(donwload), "as a book outside" the ereader, "showing" at the out recipe text, "PDF name: "abcd.pdf" file sent to the player ...

Code:

rustic algorithm example: ;-)
--
read "html rss"
  (Text in RSS: "bla bla bla "...)

 find if pdf file
    if pdf file is <500 kb?, sending / donwload "library calibre", and preparing to send to the     player  (in my case Kindle DX).
     end if

   end if
--

after, when i read on ereader ebook, and see That info, can go out and open pdf ebook That was donwload previus on ereader.

----
(the other option

(Open the pdf and insert its contents into the file of the recipe can be impractical, it would be a lot of data to be processed and the recipe would run very slowly...

actually do that would make many books as pdf, as a function of the pdf's found,

the only practical way I see to do something, it would be if "might" open a pdf (link pdf rss) and extract only the first section to be incorporated in the recipe, but it´s some slowly and big recipe size, too?, maybe...

After the caliber "convert" pdf to ebooks, but in a separate process, I fear

Keep in mind that some ereaders have native support pdf, (albeit a bit rustic and uncouth, it at least serves to read some info)

in the example given in the first place, may be the case get a recipe ebook rss news and externally several pdf's that are sent to the reader at a time ...

that meet some requirements, size, mask name, etc ... (Since there is no need to open them, just download them in certain cases )....

Quote:

indeed, this procedure would be extended to other extensions (mobi, prc, epub, etc ...) can have a web page with rss and notes-lnik to books mobi, or prc, + ,. .. and create a recipe for Calibre to download the "new" books automatically as external to the recipe, since they do not have to "process" the code within each book, (And only download if they meet the date, size, name, mask, etc. ..)

And likewise, discard extesiones not interest us for download, depending on the model output ereader, either by size, type of file ... etc ...

(sorry for the English, coordination and verbal semantic disorder, but it has been a considerable effort trying to explain a little

I had to separate the phrases in paragraphs loose and see if you see what I want to express.

by the way, congratulations on such a program again)

best Regards from Spain.

kovidgoyal · 11-10-2010, 02:59 PM

I personally am never going to implement that feature (it's way too much work as it involves monkeying with lots of the internals of how conversion works in calibre), but if you are sufficiently motivated, I'll be happy to get you started implementing it.

KRorschachZ · 11-10-2010, 03:22 PM

Quote:

Originally Posted by kovidgoyal

I personally am never going to implement that feature (it's way too much work as it involves monkeying with lots of the internals of how conversion works in calibre), but if you are sufficiently motivated, I'll be happy to get you started implementing it.

ok, thanks for read and answerd...,

only indicate that "in the first case" is not necessary the conversion of pdf files found (or other interesting extensions, such as mobi, epud, etc), "only" downloading according to the recipe found in, in any case, I depending on the configuration of "Calibre" is possible that this is complicated ... I was intrigued about the ability of "Calibre" to communicate with the main program, while making a recipe, that should be a separate process, and if there are commands in the language of the recipe to implement something. ("Download file", "size analysis", "save to send to library", etc)

(Obviously the second option, the online conversion of PDF's to integrate part of the recipe out is quite expensive computationally (clock´s cpu expend) and long ... time ... it would be like merging several small books ...)

(Even if it could indicate a maximum number of characters output, as it limits the size of the files converted, a solution, to get "part" of information from the files ... (This would be like an RSS of a PDF, mobi, epub, XD, ...), in a recipe, yet, I think this part take considerable time to the recipe ...)

(Whenever there are more pages info about reviews books in electronic format with links, they could be automatically downloaded to calibre-recipes ...)

so I thought that maybe "giving" the possibility of downloading ... without having to analyze internally ...

ok.

(We See what the community believes, maybe do I should create a thread with this issue...?)

best regards, from Spain...

10-28-2010, 02:50 PM	#61
marbs Zealot Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook	i can parse page 1 no problem. is the page that comes up automatically. i cant get to page 2. and this is as far that my code gets to: Spoiler: send: u'POST /bursa/index.asp?view=search&company_group=3000&arg_comp= &srh_comp_lb=&srh_from=2010-01-01&srh_until=2010-10-28&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press= HTTP/1.1\r\nAccept-Encoding: identity\r\nContent-Length: 933\r\nHost: maya.tase.co.il\r\nContent-Type: application/x-www-form-urlencoded\r\nConnection: close\r\nUser-Agent: Mozilla/5.0 (X11; U; i686 Linux; en_US; rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4\r\n\r\n' send: 'view=search&arg_comp=&srh_company_group=3000&srh_ company_press=&srh_comp_lb=&srh_free_text_opt=1&cm bHavarottext=&cmbHavarothidden=&srh_txt=&optionFTS earch=1&srh_from=2010-01-01&srh_from_yr=2010&srh_from_mon=1&srh_from_day=1& srh_until=2010-10-28&srh_until_yr=2010&srh_until_mon=10&srh_until_da y=28&srh_event=9999&srh_min_day=2010-01-01&srh_max_day=2010-10-28&rsSearchRes_pgNo=1&rsSearchRes_Count=64572&repT otal=30&ToPage=1&_method=%252Fbursa%252Findex.asp% 253F_method%253D_EM__onclientevent%2526pcount%253D 2%2526p0%253DBTNNEXT%2526p1%253Donclick&_BTNNEXT_s tate=_nStyle%253D1%2526value%253D%2526src%253Dimg% 252Fkadima.gif%2526alt%253D%2525u05DC%2525u05D3%25 25u05E3%252520%2525u05D4%2525u05D1%2525u05D0&_BTNP REV_state=_nStyle%253D1%2526value%253D%2526src%253 Dimg%252Fahora.gif%2526alt%253D%2525u05DC%2525u05D 3%2525u05E3%252520%2525u05D4%2525u05E7%2525u05D5%2 525u05D3%2525u05DD&_thisPage_state=pb_rsSearchRes% 253D0%2526pb_rsComByNm%253D0' reply: 'HTTP/1.1 302 Moved Temporarily\r\n' header: Location: http://maya.tase.co.il/bursa/mayaincorrect.htm header: Cache-Control: no-cache header: Pragma: no-cache header: Expires: 0 header: Content-Length: 0 header: Date: Thu, 28 Oct 2010 18:44:55 GMT header: Connection: close send: u'GET /bursa/mayaincorrect.htm HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: maya.tase.co.il\r\nConnection: close\r\nUser-Agent: Mozilla/5.0 (X11; U; i686 Linux; en_US; rv:1.8.0.4) Gecko/20060508 Firefox/1.5.0.4\r\n\r\n' reply: 'HTTP/1.1 200 OK\r\n' header: Server: Microsoft-IIS/6.0 header: X-Maya: 1 header: X-Powered-By: ASP.NET header: Content-Type: text/html header: Cache-Control: private header: Vary: Accept-Encoding header: Date: Thu, 28 Oct 2010 18:44:56 GMT header: Content-Length: 4058 header: Connection: close header: Set-Cookie: LBMaya=1; path=/bursa header: Set-Cookie: ASPSESSIONIDAACQDSTB=JKDFAGGAGKDKGMGHDCKPGLFA; path=/ <response_seek_wrapper at 0x4ff6878 whose wrapped object = <closeable_response at 0x4ff65f8 whose fp = <socket._fileobject object at 0x04D87DB0>>> this is ok Python function terminated unexpectedly expected string or buffer (Error Code: 1) Traceback (most recent call last): File "site.py", line 103, in main File "site.py", line 85, in run_entry_point File "site-packages\calibre\utils\ipc\worker.py", line 99, in main File "site-packages\calibre\gui2\convert\gui_conversion.py", line 24, in gui_convert File "site-packages\calibre\ebooks\conversion\plumber.py", line 832, in run File "site-packages\calibre\customize\conversion.py", line 211, in __call__ File "site-packages\calibre\web\feeds\input.py", line 105, in convert File "site-packages\calibre\web\feeds\news.py", line 710, in download File "site-packages\calibre\web\feeds\news.py", line 835, in build_index File "c:\users\berkow~1\appdata\local\temp\calibre_0.7. 20_tmp_9v_nmw\calibre_0.7.20_qc4llz_recipes\recipe 0.py", line 98, in parse_index articles = self.make_links(url) File "c:\users\berkow~1\appdata\local\temp\calibre_0.7. 20_tmp_9v_nmw\calibre_0.7.20_qc4llz_recipes\recipe 0.py", line 118, in make_links self.append_page(url,soup, soup.body, report3,1) File "c:\users\berkow~1\appdata\local\temp\calibre_0.7. 20_tmp_9v_nmw\calibre_0.7.20_qc4llz_recipes\recipe 0.py", line 77, in append_page soup2 = self.index_to_soup(nexturl) File "site-packages\calibre\web\feeds\news.py", line 477, in index_to_soup File "re.py", line 137, in match TypeError: expected string or buffer

10-30-2010, 05:15 PM	#62
marbs Zealot Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook	ok i'm back cleared my head a bit and i want to dive back in. do you have any ideas about why i can get to page 2?

11-01-2010, 05:10 PM	#64
marbs Zealot Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook	good answer! ill get on it. this next question is a bit off the recipe topic. how do i create a POST request from thin air in tamper data (if i clear the cookies and cache, then turn on tamper data, where will the post come from)? also, page one comes up just by entering the site, how do i skip to page two right away? and a final question, tamper data has a friend. http somthing. should i try and use that? Last edited by marbs; 11-01-2010 at 06:29 PM.

11-06-2010, 04:10 PM	#69
marbs Zealot Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook	do you know of any way to run java scripts in python? do you think Kovid would be willing to build in a tool like that? edit: ran a short search, maybe i tool like this? Last edited by marbs; 11-06-2010 at 04:17 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
New recipe	kiklop74	Recipes	0	10-05-2010 04:41 PM
New recipe	kiklop74	Recipes	0	10-01-2010 02:42 PM
New Title from Book View Cafe: A Princess of Passyunk by Maya Kaathryn Bohnhoff	suelange	Self-Promotions by Authors and Publishers	0	08-11-2010 04:35 PM
Recipe Help	lrain5	Calibre	3	05-09-2010 10:42 PM
Recipe Help Please	estral	Calibre	1	06-11-2009 02:35 PM

11-09-2010, 04:09 PM	#71
marbs Zealot Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook	i am not sure where to post a question to Kovid, so i hope you see this. i wanted to know what the chances are on getting JS support for recipes? maybe this? i haven't read it really, but i am sure python can support JS. and what is the chance is for getting support for pdf articles. i know you said that it is a printed book and not a book, but with out understanding anything about it, i feel that it might be possible to have pdf articles to pdf outputs skip the conversion in the middle and just be included in the end news feed somehow?

11-09-2010, 05:42 PM	#72
kovidgoyal creator of calibre Posts: 43,866 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Adding a javascript interpreter to python is harder than it sounds, or I would have done it already. And using QtWebKit is out of the question as it requires an X server to run, which means that the news download system would no longer work on headless servers. As for PDF, no that's not possible.

11-10-2010, 02:59 PM	#74
kovidgoyal creator of calibre Posts: 43,866 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I personally am never going to implement that feature (it's way too much work as it involves monkeying with lots of the internals of how conversion works in calibre), but if you are sufficiently motivated, I'll be happy to get you started implementing it.

Advert

Advert