following a javascript link and table editing

marbs · 09-27-2010, 11:11 AM

i think this one should be easy, but the documentation on following a java script link is only relevant for a form.

i will explain what i am trying to do with the English sites so people here can understand what i am talking about, but i will change it to Hebrew if it gets done.

on this page:
http://www.tase.co.il/TASEEng/Market...=5&IndexID=168
i want to press on additional columns. AKA this:

Spoiler:

as you see, this link holds not of the attributes that http://bugs.calibre-ebook.com/wiki/recipeGuide_advanced talks about. my google search did not get my any closer.

the page that opens in the popup is mainly a table. i want to i want that table to be the recipe. if i could remove the calibre feed index, that would also be good.
the problem that i see in the future (i havent gotten that far yet) is that the table will be too wide for the output. but 1st i want to focus on clicking on the javascript link and downloading the table to a file. i think i can do the clean up myself.

this is as far as i got.

Spoiler:

this gives me 225 pages of HTML code from http://www.tase.co.il/TASEEng/Market...=5&IndexID=168. any thoughts?

Starson17 · 09-27-2010, 12:27 PM

Quote:

Originally Posted by marbs

i think this one should be easy, but the documentation on following a java script link is only relevant for a form.... any thoughts?

First thought. It's far from easy. It's advanced recipe writing. There are several approaches. The easiest depends on the specifics of your site. Exploring the source for your page, trying to see where the data can be obtained, etc. is crucial. For example:

Case 1: I wanted slideshow pics from a javascript slideshow for the Olympics. It turned out that the javascript code included a non-displayed, non-clickable buried URL and IIRC, that URL had data that contained multiple links to the pics in different sizes. I believe I scraped page 1 to get the URL to data page 2 (don't forget to turn on scripts so they aren't stripped as in most recipes), then converted that page to a soup, scraped out the links I needed and assembled the page.

Case 2: You can do something similar to the page on login for recipes where you supply login data for a form, then submit the form. Calibre uses Mechanize for that type of work. You can have your recipe set up an internal browser, then tell it to click on any links on a page. If that's the only way to find the data you want, then you go this route. I'm not sure of how much support the Mechanize browser has for various advanced features found in Browsers today. Sometimes you have to figure out how to tell the site your browser doesn't have advanced features (UserAgent header) and hope the site will send you the data you want without too much fuss.

Mechanize is powerful, but a bit hard to get a handle on.
http://wwwsearch.sourceforge.net/mechanize/doc.html

Good luck.

marbs · 09-27-2010, 01:40 PM

1. i am happy it is not easy. i have been trying for a few days now.
2.i read the link you gave me. i learnd a few things, but there is nothing there about java scripts. am i missing something?
3. could you upload the cases you were talking about? reading them might help me understand what i need to do.

kovidgoyal · 09-27-2010, 01:45 PM

There's no way to follow a javascript link direrctly. Instead what you have to do is grab the request the javascript sends using Tamper Data in Firefox and duplicate that in calibre using mechanize.Request

Alternatively, uses regexps to parse the javascript and figure out the request url from that.

Starson17 · 09-27-2010, 02:06 PM

Quote:

Originally Posted by marbs

1. i am happy it is not easy. i have been trying for a few days now.
2.i read the link you gave me. i learnd a few things, but there is nothing there about java scripts. am i missing something?

AFAICT, you're not missing anything. Mechanize sets up a browser session inside your recipe. I have no idea what the Mechanize browser is capable of beyond basic html - forms, etc. All that I've needed it to do, it has done. (some quick research says that Mechanize couldn't do javascript in 2008. Perhaps you can set up a POST to the server to do what is needed.)

Quote:

3. could you upload the cases you were talking about? reading them might help me understand what i need to do.

The Winter Olympics 2010 is builtin (or was last I looked.) If it's not, let me know and I'll hunt it up tonight. The other was just generic discussion of using Mechanize to do things with its browser. You should focus on the site and your page source. If there's more data to be obtained, the page has to give a link to that data somehow, or the data has to be on the page. Maybe there's a link in the page source. Maybe it's in a referenced file. Sometimes you can figure it out by watching the process with a sniffer. Sometimes you can calculate it from the page you're on.

Starson17 · 09-27-2010, 02:26 PM

Quote:

Originally Posted by Starson17

If there's more data to be obtained, the page has to give a link to that data somehow

To help you on your way, I turned on Live HTTP Headers and grabbed this link when clicking the "additional columns" button.

http://www.tase.co.il/TASEEng/Manage...s+TA+Composite

It also did a GET and passed some cookies. You should be able to replicate what it does with Mechanize, without javascript to pull the data you want.

Edit: I see Kovid popped in to say basically the same thing.

marbs · 09-27-2010, 03:36 PM

thanks for the guest lecture in my master class (i am the student, if anyone missed that.

let me see if i understand. and excuse me if my lingo is not right. i am just thinking out loud.

i am trying to get to here.
the only problem is that i need to show up with something in my hand. the usual way to get that something is to stop here and get it.

now. am i trying to fake it? i think i read something about a header some where else. something about tricking it in to thinking there is an actual link (i think i used in the recipe that i published here at the top) ill go over it again and get back to you guys.
thanks

ps
AFAICT stands for "As Far As I Can Tell". it is not some fierfox add-on

Starson17 · 09-27-2010, 03:49 PM

Quote:

Originally Posted by marbs

i am trying to get to here.
the only problem is that i need to show up with something in my hand. the usual way to get that something is to stop here and get it.

Have you studied the behavior of the site? AFAICT, the first link above works fine each time you try to go there. It's possible you may need some cookie set first by visiting the second link, but in my brief tests, it doesn't look like it.

I find it's often the case that the site expects you to go to page 1, click a java link or send a form, etc., but you can bypass all of that and just go to the final link to get what you want.

Quote:

ps
AFAICT stands for "As Far As I Can Tell". it is not some fierfox add-on

If it doesn't, the above makes no sense

Bottom line - keep studying the behavior of the site. If it turns out you need special cookies, or referer headers, it will show up in your careful tests. It's possible to get those with Mechanize, if needed.

kovidgoyal · 09-27-2010, 04:06 PM

HTTP is a stateless protocol. What that means is that any URL of the type http:// will always work the same no matter it what sequence you visit URLs.

However, since sites like to have sessions and keep track of what users are doing, they send what's called cookies to the users browser. The user's browser stores these cookies and send them back to the site on demand. Some links in a site will not work without the appropriate cookie.

mechanize handles cookies transparently. If you think you need to visit URLs in sequence, do so in the calibre recipe and the cookies will work seamlessly.

marbs · 09-27-2010, 04:44 PM

so if i do br.open (the 1st site) and then
br.open (the second site), that should work, as far as i can tell.
what is happening, is just visiting in the 1st site is enough to let me in to the second. i think i know how to do that.

now i have some pythoning to do (but i will have the same type of trouble with my final recipe).

this is what i am thinking of doing:

Spoiler:

i need to dig a bit here

Starson17 · 09-27-2010, 05:57 PM

Quote:

Originally Posted by marbs

so if i do br.open (the 1st site) and then
br.open (the second site), that should work, as far as i can tell.
what is happening, is just visiting in the 1st site is enough to let me in to the second.

I did a quick test with cookies cleared, and that seems correct.

As long as we're covering all the gory details, note that although cookies are handled transparently, the Referer header is not (that's the correct spelling for the referrer header). You have to deal with that manually, if it's needed. (In this case, it seems to not be important.) You can also handle cookies manually, should you need to (I never have) and sometimes you may need to add other headers that are not added by default (Accept headers are sometimes needed to satisfy the Bad Behavior blog plugin).

Finally, the ignore robots.txt is turned on in Calibre by default when it uses Mechanize.

There's no substitute for a careful analysis of how the site responds and what it needs to give you what you're looking for.

It looks like you're on the right road!

marbs · 09-28-2010, 05:21 AM

i have been working on this for a few hours. i have my table, and i am very happy with it. it fits in to the page in some magical way.
i gave it a fake feed to parse, and just had it return the address that i wanted. i am not sure why it works, but it does.

now i want to remove the "feeds" menu that calibre creates (page 2 in any other recipe) and the section menu (page 3 in any other recipe). is there a way to do that?

Spoiler:

Code:

from calibre.ptempfile import PersistentTemporaryFile
import mechanize
class AdvancedUserRecipe1282101454(BasicNewsRecipe):
    title          = u'TA stock table'
    oldest_article = 1
    baseURL='http://www.tase.co.il/TASE/MarketData/Indices/MarketCap/IndexMainDataMarket.htm?Action=5&IndexID=168'
    __author__            = 'marbs'
    max_articles_per_feed = 1
    #no_stylesheets = True
    #extra_css = ' body{font-family: Arial,Helvetica,sans-serif } '
    cover_url      = 'http://money-talks.co.il/wp-content/uploads/2008/02/glasses_on_newspaper.jpg'
    feeds          = [(u'maya', u'http://maya.tase.co.il/bursa/rss/maya.xml')]
    temp_files = []
    articles_are_obfuscated = True
    keep_only_tags    = [ dict(name='table',attrs={'id':'NiaROGrid1_DataGrid1'})]
                                    #style':['float: right;', 'float: left;'
    def get_obfuscated_article(self, url):
        br = self.get_browser()
        br.open('http://www.tase.co.il/TASE/MarketData/Indices/MarketCap/IndexMainDataMarket.htm?Action=5&IndexID=168')

        response = br.open('http://www.tase.co.il/TASE/Management/GeneralPages/PopUpGrid.htm?tbl=0&Columns=he-IL_AddColColumns&Titles=he-IL_AddColTitles&ds=he-IL_ds&enumTblType=SharesByIndex&sess=he-IL_&gridName=%D7%A0%D7%AA%D7%95%D7%A0%D7%99+%D7%9E%D7%A1%D7%97%D7%A8+-+%D7%9E%D7%A0%D7%99%D7%95%D7%AA+%D7%AA%22%D7%90+%D7%9B%D7%9C%D7%9C%D7%99')
        html = response.read()

        self.temp_files.append(PersistentTemporaryFile('_fa.html'))
        self.temp_files[-1].write(html)
        self.temp_files[-1].close()

        return self.temp_files[-1].name


#    def get_obfuscated_article(self, url):
#        br = BasicNewsRecipe.get_browser()
#        br.open('http://www.tase.co.il/TASE/MarketData/Indices/MarketCap/IndexMainDataMarket.htm?Action=5&IndexID=168')
#        br.open('http://www.tase.co.il/TASE/Management/GeneralPages/PopUpGrid.htm?tbl=0&Columns=he-IL_AddColColumns&Titles=he-IL_AddColTitles&ds=he-IL_ds&enumTblType=SharesByIndex&sess=he-IL_&gridName=%D7%A0%D7%AA%D7%95%D7%A0%D7%99+%D7%9E%D7%A1%D7%97%D7%A8+-+%D7%9E%D7%A0%D7%99%D7%95%D7%AA+%D7%AA%22%D7%90+%D7%9B%D7%9C%D7%9C%D7%99')
#        print_url = 'http://tase.co.il/TASEEng/Management/GeneralPages/PopUpGrid.htm?tbl=0&Columns=en-US_AddColColumns&Titles=en-US_AddColTitles&ds=en-US_ds&enumTblType=SharesByIndex&sess=en-US_&gridName=Market+Data+-+Shares+General'
#        response = br.follow_link(mechanize.Link(base_url = '', url = print_url, text = '', tag = '', attrs = []))
#        
#        html = response.read()
#
#        self.temp_files.append(PersistentTemporaryFile('_fa.html'))
#       self.temp_files[-1].write(html)
#       self.temp_files[-1].close()
#
        return br#self.temp_files[-1].name
 
    def get_article_url(self, article):
        return 'http://www.tase.co.il/TASE/Management/GeneralPages/PopUpGrid.htm?tbl=0&Columns=he-IL_AddColColumns&Titles=he-IL_AddColTitles&ds=he-IL_ds&enumTblType=SharesByIndex&sess=he-IL_&gridName=%D7%A0%D7%AA%D7%95%D7%A0%D7%99+%D7%9E%D7%A1%D7%97%D7%A8+-+%D7%9E%D7%A0%D7%99%D7%95%D7%AA+%D7%AA%22%D7%90+%D7%9B%D7%9C%D7%9C%D7%99'

so i got a little greedy. is there an easy way to brake the table in half?
i can think of 3 things that might work (i just dont know how to do them)
the 1st is to remove some less relevant columns.
the 2nd is to cut every row in half. and have :
1st row right half
1st row left half
2nd row right half
and so on.
the 3rd is to cut the hole table in half and add hte right most colont to the 2nd half too
1st row right half
2nd row right hald
.
.
.
top right cell + 1st row left half
2nd from the top right cell + 2nd row left half
.
.
.

possible?

Starson17 · 09-28-2010, 11:10 AM

Quote:

Originally Posted by marbs

now i want to remove the "feeds" menu that calibre creates (page 2 in any other recipe) and the section menu (page 3 in any other recipe). is there a way to do that?

That's part of the default structure that a recipe builds. I suspect you might be able to override some portion of the recipe system to do that, but 1) I've never seen a recipe that does that, 2) if it's possible, you'd probably need Kovid to tell you how, or you'd need to dig into the recipe code.

Quote:

is there an easy way to brake the table in half?
i can think of 3 things that might work (i just dont know how to do them)

If you want to do the work, yes it's possible. This is just a matter of modifying the html. You can use Beautiful Soup to change the table tags, or use regular expressions to find the tags that need to be changed. It's going to take some effort, but the concept is fundamentally pretty simple if you know the html you start with and the html you want to end up with.

kovidgoyal · 09-28-2010, 11:28 AM

You can replace the feed menu by a blank page using extra_css but if you actually want it to not be created at all, you will haveto reimplement various functions in BasicNewsRecipe

marbs · 09-28-2010, 04:04 PM

i tried playing around with some table code.
added this:

Spoiler:

and lost the hole table. i think i will leave it at that for now.

Kovid,
did you mean adding extra_css = '' to the code?

in any case, i am very happy with what i have achieved. i really appreciate the point (or push) in the right direction. i am getting a lot out of the advice instead of just answers.

09-27-2010, 01:40 PM	#3
marbs Zealot Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook	so i read the link you gave me. 1. i am happy it is not easy. i have been trying for a few days now. 2.i read the link you gave me. i learnd a few things, but there is nothing there about java scripts. am i missing something? 3. could you upload the cases you were talking about? reading them might help me understand what i need to do.

09-27-2010, 03:36 PM	#7
marbs Zealot Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook	hey Kovid thanks for the guest lecture in my master class (i am the student, if anyone missed that. let me see if i understand. and excuse me if my lingo is not right. i am just thinking out loud. i am trying to get to here. the only problem is that i need to show up with something in my hand. the usual way to get that something is to stop here and get it. now. am i trying to fake it? i think i read something about a header some where else. something about tricking it in to thinking there is an actual link (i think i used in the recipe that i published here at the top) ill go over it again and get back to you guys. thanks ps AFAICT stands for "As Far As I Can Tell". it is not some fierfox add-on Last edited by marbs; 09-27-2010 at 03:40 PM.

09-28-2010, 04:04 PM	#15
marbs Zealot Posts: 122 Karma: 10 Join Date: Jul 2010 Device: nook	i tried playing around with some table code. added this: Spoiler: Code: # def preprocess_html(self, soup): # rows = table.findAll('tr') # cols = rows.findAll('td') # soup1 = cols[0].string # return soup1 and lost the hole table. i think i will leave it at that for now. Kovid, did you mean adding extra_css = '' to the code? in any case, i am very happy with what i have achieved. i really appreciate the point (or push) in the right direction. i am getting a lot out of the advice instead of just answers.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Using Mobipocket Creator; link to table of contents	Ea	Kindle Formats	13	05-20-2011 04:12 AM
Anyone know how to convert a pdf table into a table in Word or HTML?	BasilC	Workshop	7	06-25-2010 01:02 AM
Sideway Table in ePub (Rotate table/text)	Lapiz	ePub	3	01-29-2010 01:11 PM
Forget coffee table books-- how about a kitchen table book?	ardeegee	Lounge	10	12-02-2009 12:00 PM
I need Javascript help	Nate the great	Workshop	4	04-04-2009 12:55 AM

09-27-2010, 01:45 PM	#4
kovidgoyal creator of calibre Posts: 45,339 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	There's no way to follow a javascript link direrctly. Instead what you have to do is grab the request the javascript sends using Tamper Data in Firefox and duplicate that in calibre using mechanize.Request Alternatively, uses regexps to parse the javascript and figure out the request url from that.

09-27-2010, 04:06 PM	#9
kovidgoyal creator of calibre Posts: 45,339 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	HTTP is a stateless protocol. What that means is that any URL of the type http:// will always work the same no matter it what sequence you visit URLs. However, since sites like to have sessions and keep track of what users are doing, they send what's called cookies to the users browser. The user's browser stores these cookies and send them back to the site on demand. Some links in a site will not work without the appropriate cookie. mechanize handles cookies transparently. If you think you need to visit URLs in sequence, do so in the calibre recipe and the cookies will work seamlessly.

09-28-2010, 11:28 AM	#14
kovidgoyal creator of calibre Posts: 45,339 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You can replace the feed menu by a blank page using extra_css but if you actually want it to not be created at all, you will haveto reimplement various functions in BasicNewsRecipe

Advert

Advert