Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 10-26-2010, 04:01 PM   #46
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
i know you must be busy, but i would really appreciate if you could take a look at the site and the code when you have the time.

i got tamper data and as far as i can see, the only parameter that makes a difference is rsSearchRes_pgNo. i just dot really know what i am doing with this and feel a little lost.
marbs is offline   Reply With Quote
Old 10-26-2010, 04:24 PM   #47
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by marbs View Post
i know you must be busy, but i would really appreciate if you could take a look at the site and the code when you have the time.

i got tamper data and as far as i can see, the only parameter that makes a difference is rsSearchRes_pgNo. i just dot really know what i am doing with this and feel a little lost.
I'll see what I can do when I get access to a machine that has Calibre on it. I may need your help, IIRC, your site is not in English. I'm only good in English, OpCodes for x86 and a few microcontrollers and pigLatin.

If I'm going to look at it, I need you to tell me how rsSearchRes_pgNo is used? Part of the URL? Part of a Header? Part of a Cookie? I think you said you see it in TamperData - what field? What format?
Edit: - I looked at your page (I think I'm at the right one), but I'm not sure how to get the next 30 you want. Tell me in detail what to press or cahnge on that page to get the next group of data. Where? Tell me how to know when I've got all the data - what stops appearing or appears or whatever.

Last edited by Starson17; 10-26-2010 at 04:28 PM.
Starson17 is offline   Reply With Quote
Advert
Old 10-26-2010, 05:31 PM   #48
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
this is my practice page. it is of the same structure like more important pages, but it is almost static.

1. i have the total number of articles. you get in in the variable "report3" in the function make links.

2. every page has 30, so you know how many you have already.

3. my example page has 67 articles. (you can print report 3 to check that).

4. 67 reports means 3 pages. if you go to the bottom of the page you can read the numbers. it will say "1 of 3" (in the wrong language).

5. if you want to go to the next page you can replace the "1" with a "2" and hit enter or you can click on the gray arrows.

6. on this page you have a lot of pages to play with. 2138 pages with 30 articles each.

as for this:
If I'm going to look at it, I need you to tell me how rsSearchRes_pgNo is used? Part of the URL? Part of a Header? Part of a Cookie? I think you said you see it in TamperData - what field? What format?

i have no idea. i never did anything higher than C language. i didnt know what RSS was a month and a half ago. give me a microcontoler, on the other hand, and things will start blinking, moving, beeping and doing all sorts o cool stuff.
marbs is offline   Reply With Quote
Old 10-26-2010, 06:15 PM   #49
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by marbs View Post
as for this:
If I'm going to look at it, I need you to tell me how rsSearchRes_pgNo is used?
i have no idea.
So a more basic question, then: Where did the string "rsSearchRes_pgNo" come from? Is it in the page source? Did you see it in some output? In TamperData? I just need something to start on.
Starson17 is offline   Reply With Quote
Old 10-27-2010, 03:15 AM   #50
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
in tamperdata

on the right hand side near the bottom.
you have to scroll down a bit to see it.

as far as i can see, you can leave all the fields on the right hand side except "rsSearchRes_pgNo" blank and you will still get your next page.

Last edited by marbs; 10-27-2010 at 04:10 AM.
marbs is offline   Reply With Quote
Advert
Old 10-27-2010, 02:34 PM   #51
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by marbs View Post
on the right hand side near the bottom.
you have to scroll down a bit to see it.

as far as i can see, you can leave all the fields on the right hand side except "rsSearchRes_pgNo" blank and you will still get your next page.
I found it immediately after I posted. As time permits, I'll look it over.
Starson17 is offline   Reply With Quote
Old 10-27-2010, 04:04 PM   #52
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
even if you don't get around to it, thank you very much.
it really is a great help. especially the fact that you dont just give the answers, you send me out looking (in a confined area) for it. i really am learning a lot.
marbs is offline   Reply With Quote
Old 10-27-2010, 04:36 PM   #53
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by marbs View Post
hint?
I didn't have much time, but briefly, what's happening is you have some java running when you enter a next page number at the bottom of your page. That number is added to a POST which goes to your url. You saw the POST in TamperData. You need to simulate it, or at least simulate the important parts, like rsSearchRes_PgNo.

It's done as follows:

Code:
        data = urllib.urlencode({ 'rsSearchRes_PgNo':'2'})
        url = 'http:// whatever'
        br.open(url, data)
That just sets one value. I don't know how much testing you did with TamperData, but you can empty the values, or delete them entirely. You may need to test with the various parameters actually deleted in tampered TamperDAta POST commands to make sure that's the only item needed. If you need more, you can add as many parameters in the data = line as you want to be sent by the POST.

The data that you send in the POST can be seen with:
Code:
        # Print HTTP headers.
        br.set_debug_http(True)
Starson17 is offline   Reply With Quote
Old 10-27-2010, 04:43 PM   #54
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by marbs View Post
you dont just give the answers, you send me out looking (in a confined area) for it. i really am learning a lot.
You can look at the greader built in recipe to see an example of multiple items used in the data of an HTTP POST.

You can look at the Mechanize docs for more info:
http://wwwsearch.sourceforge.net/mec...-added-headers
Starson17 is offline   Reply With Quote
Old 10-27-2010, 05:42 PM   #55
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
Quote:
Originally Posted by Starson17 View Post
I didn't have much time, but briefly, what's happening is you have some java running when you enter a next page number at the bottom of your page. That number is added to a POST which goes to your url. You saw the POST in TamperData. You need to simulate it, or at least simulate the important parts, like rsSearchRes_PgNo.

It's done as follows:

Code:
        data = urllib.urlencode({ 'rsSearchRes_PgNo':'2'})
        url = 'http:// whatever'
        br.open(url, data)
That just sets one value. I don't know how much testing you did with TamperData, but you can empty the values, or delete them entirely. You may need to test with the various parameters actually deleted in tampered TamperDAta POST commands to make sure that's the only item needed. If you need more, you can add as many parameters in the data = line as you want to be sent by the POST.

The data that you send in the POST can be seen with:
Code:
        # Print HTTP headers.
        br.set_debug_http(True)
i thought this is what i did. if you take a look at msg #44 of this post, you'll see it.
i copied form google reader. ill take an other look at google and a look at greader.
marbs is offline   Reply With Quote
Old 10-27-2010, 05:58 PM   #56
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by marbs View Post
i thought this is what i did. if you take a look at msg #44 of this post, you'll see it.
i copied form google reader. ill take an other look at google and a look at greader.
If that's what you did, then you need to track down whether you're getting the page back that you expect to get. If not, then you need to track down if you are sending what you think you need to send in the POST data. It's just a matter of sending the right data in the POST, checking the results, etc. If you send the right data, you should get back the right page. Have you gotten back page 2 yet?
Starson17 is offline   Reply With Quote
Old 10-28-2010, 05:21 AM   #57
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
i reacreated the post perfectly

it still does not work.
i also do not know how to deal with the difference between a request on this page and this page. or how to deal with dates. i think i will wait for when you have time and energy to lead the way. i am dreaming post and get on tamper data and it is time to step in down a noch. at least for a day or two.

here is the code:
Spoiler:
Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
import urllib, mechanize
from calibre import __appname__

class AlisonB(BasicNewsRecipe):
    title      ='Maya'
    __author__ = 'marbs'
    description = 'blah'
    language = 'en'
    no_stylesheets = True
    publisher           = 'marbs'
#    simultaneous_downloads = 1
#    delay                  = 25   
    category            = 'column'
    extra_css='body{direction: rtl;} .article_description{direction: rtl; } a.article{direction: rtl; } .calibre_feed_description{direction: rtl; }'
#    no_stylesheets = True
    use_embedded_content= False
    remove_attributes = ['width','height']
    no_stylesheets      = True
    oldest_article      = 24
    remove_javascript   = True
    remove_empty_feeds  = True
    rec_index = 0
    max_articles_per_feed = 5000
    INDEX = 'http://maya.tase.co.il/'

    def append_page(self, url, soup, appendtag, articles_left, position):
   #     print url, 'the soup is:' ,soup, 'appendtag is',appendtag,  articles_left, position
        articles_left = articles_left - 30
        articlenum= articles_left + (30*position)
        print  articles_left, articlenum
        br = BasicNewsRecipe.get_browser(self)
        if (articles_left <'0') :
           print 'do i get this far?'
           # Print HTTP headers.
           br.set_debug_http(True)
           request = urllib.urlencode([
                                                  ('view','search'),
                                                  ('arg_comp',''),
                                                  ('srh_company_group','3000'),
                                                  ('srh_company_press',''),
                                                  ('srh_comp_lb',''),
                                                  ('srh_free_text_opt','1'),
                                                  ('cmbHavarottext',''),
                                                  ('cmbHavarothidden',''),
                                                  ('srh_txt',''),
                                                  ('optionFTSearch','1'),
                                                  ('srh_from','2010-01-01'),
                                                  ('srh_from_yr','2010'),
                                                  ('srh_from_mon','1'),
                                                  ('srh_from_day','1'),
                                                  ('srh_until','2010-10-28'),
                                                  ('srh_until_yr','2010'),
                                                  ('srh_until_mon','10'),
                                                  ('srh_until_day','28'),
                                                  ('srh_event','9999'),
                                                  ('srh_min_day','2010-01-01'),
                                                  ('srh_max_day','2010-10-28'),
                                                  ('rsSearchRes_pgNo',position),
                                                  ('rsSearchRes_Count',articlenum),
                                                  ('repTotal','30'),
                                                  ('ToPage',position),
                                                  ('_method','%2Fbursa%2Findex.asp%3F_method%3D_EM__onclientevent%26pcount%3D2%26p0%3DBTNNEXT%26p1%3Donclick'),
                                                  ('_BTNNEXT_state','_nStyle%3D1%26value%3D%26src%3Dimg%2Fkadima.gif%26alt%3D%25u05DC%25u05D3%25u05E3%2520%25u05D4%25u05D1%25u05D0'),
                                                  ('_BTNPREV_state','_nStyle%3D1%26value%3D%26src%3Dimg%2Fahora.gif%26alt%3D%25u05DC%25u05D3%25u05E3%2520%25u05D4%25u05E7%25u05D5%25u05D3%25u05DD'),
                                                  ('_thisPage_state','pb_rsSearchRes%3D0%26pb_rsComByNm%3D0')
                                                   ])
        #   print 'lalala', request
           # Print HTTP headers.
           br.set_debug_http(True)
           nexturl = br.open(url, request)
           print nexturl, 'this is ok'
           soup2 = self.index_to_soup(nexturl)
           print 'this is my real tesst', soup2

#           texttag = soup2.find('div', attrs={'class':'bodytext'})
#           for it in texttag.findAll(style=True):
#               del it['style']
#           newpos = len(texttag.contents)
#           self.append_page(soup2,texttag,newpos)
#           texttag.extract()
#           appendtag.insert(position,texttag)

    

    def parse_index(self):
        feeds = []
        for title, url in [
                            (u"too long",u"http://maya.tase.co.il/bursa/index.asp?view=search&company_group=3000&arg_comp=&srh_comp_lb=&srh_from=2010-01-01&srh_until=2010-10-28&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press="),
                        #(u"Feed", u"http://maya.tase.co.il/bursa/index.asp?view=search&company_group=3000&arg_comp=&srh_comp_lb=1007&srh_from=2010-01-01&srh_until=2010-09-28&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press="),
                            (u"הודעות מאתמול", u"http://maya.tase.co.il/bursa/index.asp?view=yesterday"),                            
                            
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        print 'url is', url

        print 'The soup is: ', soup
        stop = soup.find('td',attrs={'height':'19'})
        print 'the stop is', stop
        report1 = stop.contents[1].contents 
        print report1
        report2 = report1[0]
        print report2
        report3=int(report2.encode('ascii'))
        print report3
        self.append_page(url,soup, soup.body, report3,1)
        for item in soup.findAll('a',attrs={'class':'A3'}):
            print 'item is: ',item
            itemcomp= item.findPrevious('a',attrs={'id':'CompNmHref'})
            #link = item.find('a')
            #titlecheck = self.tag_to_string(link)
            #url_test = re.search('javascript', item['href'])
           
            if not re.search('javascript', item['href']):
              temp2= self.INDEX + 'bursa/' + item['href']
        #      temp2=[temp3]
              print 'url1 is', temp2
              soup1 = self.index_to_soup(temp2)
  #            print 'the new soup is', soup1
              print '6714' 
              for item1 in soup1.findAll('iframe'):
                 print 'item1 is:' , item1
 #                print soup.item.previous.previousSibling
                 txt= item1['src']
                 print 'FOUND GOOD URL'
                 re1='.*?'	# Non-greedy match on filler
                 re2='(mayafiles)'	# Variable Name 1
                 re3='(.)'	# Any Single Character 1
                 re4='.*?'	# Non-greedy match on filler
                 re5='htm'	# Uninteresting: word
                 re6='.*?'	# Non-greedy match on filler
                 re7='(htm)'	# Word 1
                 
                 rg = re.compile(re1+re2+re3+re4+re5+re6+re7,re.IGNORECASE|re.DOTALL)
                 m = rg.search(txt)
                 if m:
                     var1=m.group(1)
                     c1=m.group(2)
                     word1=m.group(3)
                     print "("+var1+")"+"("+c1+")"+"("+word1+")"+"\n"
                     url = item1['src']
                 else:
                     url = 'http://www.pdfdownload.org/pdf2html/pdf2html.php?url=' + item1['src'] + '&images=yes'

                 print 'url is: ', url

                 title       = self.tag_to_string(itemcomp)+ ' - ' + self.tag_to_string(item)
                 print 'title is: ', title
                 current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
            
           
        return current_articles
marbs is offline   Reply With Quote
Old 10-28-2010, 11:21 AM   #58
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by marbs View Post
i reacreated the post perfectly
it still does not work.
I'm not sure what this means. Your code has append_page code and other stuff that looks to me like it's just getting in the way of the simple problem you need to solve. You want to be able to retrieve page 1 and page 2. Page 1 should come by default. Page 2 should come when you do the post with the right parameters. If it doesn't, perhaps there is other protection, such as cookies or referer, etc.

I'm not sure if this: "i reacreated the post perfectly" means that you downloaded page 1 or page 2, but until you get both pages properly retrieved, there's not much point in using append_page and trying to put them together as a multipage recipe does.
Starson17 is offline   Reply With Quote
Old 10-28-2010, 12:18 PM   #59
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
what i meant was that i copied all the parameters of the post request.

i cant get the 2nd page so there is nothing to append.
marbs is offline   Reply With Quote
Old 10-28-2010, 02:08 PM   #60
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by marbs View Post
what i meant was that i copied all the parameters of the post request.

i cant get the 2nd page so there is nothing to append.
OK, and I presume you've monitored what you sent and the response, and you can see that the first time you request, it sends the request for page 1, and the second time you request it sends the request for page 2? What do you get from the page 2 request? Is it page 1?
Starson17 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
New recipe kiklop74 Recipes 0 10-05-2010 04:41 PM
New recipe kiklop74 Recipes 0 10-01-2010 02:42 PM
New Title from Book View Cafe: A Princess of Passyunk by Maya Kaathryn Bohnhoff suelange Self-Promotions by Authors and Publishers 0 08-11-2010 04:35 PM
Recipe Help lrain5 Calibre 3 05-09-2010 10:42 PM
Recipe Help Please estral Calibre 1 06-11-2009 02:35 PM


All times are GMT -4. The time now is 03:40 AM.


MobileRead.com is a privately owned, operated and funded community.