Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 10-23-2010, 03:14 PM   #31
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
thank you.

sometimes, when i get caught up in something, i for get to explain my self.

if you take a look here you will see a list of 30 articles. to see the rest of the articles, you need to go to the bottom of the page and press the next button. then you get an other page with 30 articles. and so on.

i used tamper data to find what that button does. it submits a long request with a lot of parameters, but as far as i can see, the only one that matters is "rsSearchRes_pgNo" and you give it the page number you want.

how can i incorporate that in my code?

also, when i am done with this, i want to turn this recipe in to real python code so i can deal with the pdf articles. how hard do you think that might be? (i tried to get started on that, ran in to some trouble and posted a question on stack overflow. all i got was a nasty response that didnt answer the question)
marbs is offline   Reply With Quote
Old 10-23-2010, 06:46 PM   #32
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by marbs View Post
sometimes, when i get caught up in something, i for get to explain my self.
Have you tried treating each page of 30 as a separate feed?
Starson17 is offline   Reply With Quote
Advert
Old 10-24-2010, 01:28 AM   #33
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
all the pages have the same url.
and i have no way to know how meny pages there are.

Last edited by marbs; 10-24-2010 at 01:31 AM.
marbs is offline   Reply With Quote
Old 10-24-2010, 08:20 AM   #34
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by marbs View Post
all the pages have the same url.
and i have no way to know how meny pages there are.
Then this sounds like a typical multipage recipe situation. You build an internal browser, grab the first page, have it press the button, grab as much of the second page as you need and recursively continue until you have all pages, then feed this to your parser. There are several multipage examples. I usually recommend looking at Adventure Gamer or just searching for "multipage" in the big sticky. Any recipe that uses"append_page" will be of interest to you. You can search the builtins (or the big sticky) to see more examples.
Starson17 is offline   Reply With Quote
Old 10-24-2010, 02:28 PM   #35
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
i read a few examples and i think i can write the function it self.
i am not sure i know how to use it. or how to call on it.

i got this far:
Spoiler:
Code:
#    def append_page(self, soup, appendtag, position):
 #       rec_index =     rec_index +'1'
   #     stop = soup.find('td',attrs={'height':'19'})
  #      reportnum = stop.findnext('b')
  #      num = self.tag_to_string(reportnum)
  #      articles_left =num - '30'
  #      if (articles_left <'0') :
    #       request =  = urllib.urlencode([('rsSearchRes_pgNo',rec_index)])
    #        nexturl = br.open(url, request)
##           soup2 = self.index_to_soup(nexturl)
#           texttag = soup2.find('div', attrs={'class':'bodytext'})
#           for it in texttag.findAll(style=True):
#               del it['style']
#           newpos = len(texttag.contents)
#           self.append_page(soup2,texttag,newpos)
#           texttag.extract()
#           appendtag.insert(position,texttag)


but now i am lost. i dont know where i am going with this. can someone focus me again?

Last edited by marbs; 10-24-2010 at 04:12 PM.
marbs is offline   Reply With Quote
Advert
Old 10-25-2010, 10:43 AM   #36
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by marbs View Post
i read a few examples and i think i can write the function it self.
i am not sure i know how to use it. or how to call on it.
but now i am lost. i dont know where i am going with this. can someone focus me again?
I don't have much time, so I haven't looked at your function, but normally, it's used this way:
Code:
    def preprocess_html(self, soup):
        self.append_page(soup, soup.body, 3)
        return soup
This takes the article page before it's processed (at the preprocess_html stage) and uses append_page to stick the modified article page into the body of the soup. The "modified page" is the first article page, plus the content of all the subsequent pages obtained by pressing the next page button, which have been tacked onto the bottom of the first page. You will note that append_page is recursive and runs until there are no more next page buttons.

The result will be that the recipe will see a single page article with all the content from all the multiple pages before it begins to process that article.

Does that help?

Last edited by Starson17; 10-25-2010 at 10:49 AM.
Starson17 is offline   Reply With Quote
Old 10-25-2010, 10:55 AM   #37
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
i think it does help.
ill work on the code and see what comes out.

i asked this before, but i think it was missed with all the other stuff going on. how hard would it be to make this script run in python (not in calibre)? i still want to get the pdf files...
marbs is offline   Reply With Quote
Old 10-25-2010, 11:08 AM   #38
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by marbs View Post
i asked this before, but i think it was missed with all the other stuff going on. how hard would it be to make this script run in python (not in calibre)? i still want to get the pdf files...
It wasn't missed - I ignored it

Calibre is mostly a superset of Python. I'm not sure what you're asking. You can easily run any recipe outside of the GUI with ebook-convert. You can easily import anything you need from Python. You can easily subclass any of the provided classes to override or modify program behavior. You can easily run .py code file with calibre-debug -e outside the GUI. If I understand it, I think you want to do a GET of a pdf file, perhaps run some conversion on it, etc.? I suspect it's possible, but I've never seen it done. I'd just start doing it and solve the problems as they appear. I don't see anything that jumps out at me and says it's impossible, provided you're willing to put in the effort, but I don't know of any stock code that will do all that you might want done (whatever that is?)
Starson17 is offline   Reply With Quote
Old 10-25-2010, 01:19 PM   #39
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
what i meant is that calibre does 90% of the work for you (i think).
how hard would it be to re-build most of the components that are needed to get this thing up and running?
marbs is offline   Reply With Quote
Old 10-25-2010, 01:32 PM   #40
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by marbs View Post
what i meant is that calibre does 90% of the work for you (i think).
how hard would it be to re-build most of the components that are needed to get this thing up and running?
The source is available. You can easily run Calibre from source and modify it as desired. If you want to build it fresh from source, that's been done, too. I'm not sure why you'd want to, but you can. I can't tell you how many of the multiple libraries Calibre uses that you'd need to run this particular recipe.
Starson17 is offline   Reply With Quote
Old 10-25-2010, 01:47 PM   #41
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
ok. then after i i finish this multipage issue i would like to do that.

while i am working on this, i have an other recipe. the articles have pictures. if there is a picture that is wider than the output file, the text goes over the edge too. is there a way to shrink the picture to fit the output file or at least to stop the text form expanding?
marbs is offline   Reply With Quote
Old 10-25-2010, 01:53 PM   #42
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by marbs View Post
ok. then after i i finish this multipage issue i would like to do that.

while i am working on this, i have an other recipe. the articles have pictures. if there is a picture that is wider than the output file, the text goes over the edge too. is there a way to shrink the picture to fit the output file or at least to stop the text form expanding?
extra_css gives control. It's also controlled by the output device set in Calibre.

For comics, I often specify :
img {max-width:100%; min-width:100%;}
This works well in the viewer when read on a wide screen.
Starson17 is offline   Reply With Quote
Old 10-25-2010, 02:45 PM   #43
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
ill give it a try.

how do i convert unicode to int? do i need struct? is it included in calibre?
marbs is offline   Reply With Quote
Old 10-25-2010, 07:01 PM   #44
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
this is as far as i got. i dont think i am posting my request correctly. or i may not be reading in correctly.

what i did just before i called append_page in the main program is to find the number of articles. i know i have 30 articles per page so that is the explanation for that.

hint?

Spoiler:
Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, re
import urllib, mechanize
from calibre import __appname__

class AlisonB(BasicNewsRecipe):
    title      ='Maya'
    __author__ = 'marbs'
    description = 'blah'
    language = 'en'
    no_stylesheets = True
    publisher           = 'marbs'
#    simultaneous_downloads = 1
#    delay                  = 25   
    category            = 'column'
    extra_css='body{direction: rtl;} .article_description{direction: rtl; } a.article{direction: rtl; } .calibre_feed_description{direction: rtl; }'
#    no_stylesheets = True
    use_embedded_content= False
    remove_attributes = ['width','height']
    no_stylesheets      = True
    oldest_article      = 24
    remove_javascript   = True
    remove_empty_feeds  = True
    rec_index = 0
    max_articles_per_feed = 5000
    INDEX = 'http://maya.tase.co.il/'

    def append_page(self, url, soup, appendtag, report, position):
        print url, 'the soup is:' ,soup, 'appendtag is',appendtag, report, position
        articles_left = report - 30
        print  articles_left
        br = BasicNewsRecipe.get_browser(self)
        if (articles_left >'0') :
           request = urllib.urlencode([('rsSearchRes_pgNo',position)])                                #the problem is here somewhere
           print request
           nexturl = br.open(url, request)
           print nexturl
           soup2 = self.index_to_soup(nexturl)
           print 'this is my real tesst', soup2

#           texttag = soup2.find('div', attrs={'class':'bodytext'})
#           for it in texttag.findAll(style=True):
#               del it['style']
#           newpos = len(texttag.contents)
#           self.append_page(soup2,texttag,newpos)
#           texttag.extract()
#           appendtag.insert(position,texttag)

    

    def parse_index(self):
        feeds = []
        for title, url in [
                            #(u"Feed", u"http://maya.tase.co.il/bursa/index.asp?view=search&company_group=3000&arg_comp=&srh_comp_lb=1007&srh_from=2010-01-01&srh_until=2010-09-28&srh_anaf=-1&srh_event=9999&is_urgent=0&srh_company_press="),
                            (u"הודעות מאתמול", u"http://maya.tase.co.il/bursa/index.asp?view=yesterday"),                            
                            
                             ]:
            articles = self.make_links(url)
            if articles:
                feeds.append((title, articles))
        return feeds
        
    def make_links(self, url):
        title = 'Temp'
        current_articles = []
        soup = self.index_to_soup(url)
        print 'url is', url

        print 'The soup is: ', soup
        stop = soup.find('td',attrs={'height':'19'})
        print 'the stop is', stop
        report1 = stop.contents[1].contents 
        print report1
        report2 = report1[0]
        print report2
        report3=int(report2.encode('ascii'))
        print report3
        self.append_page(url,soup, soup.body, report3,1)
        for item in soup.findAll('a',attrs={'class':'A3'}):
            print 'item is: ',item
            itemcomp= item.findPrevious('a',attrs={'id':'CompNmHref'})
            #link = item.find('a')
            #titlecheck = self.tag_to_string(link)
            #url_test = re.search('javascript', item['href'])
           
            if not re.search('javascript', item['href']):
              temp2= self.INDEX + 'bursa/' + item['href']
        #      temp2=[temp3]
              print 'url1 is', temp2
              soup1 = self.index_to_soup(temp2)
  #            print 'the new soup is', soup1
              print '6714' 
              for item1 in soup1.findAll('iframe'):
                 print 'item1 is:' , item1
 #                print soup.item.previous.previousSibling
                 txt= item1['src']
                 print 'FOUND GOOD URL'
                 re1='.*?'	# Non-greedy match on filler
                 re2='(mayafiles)'	# Variable Name 1
                 re3='(.)'	# Any Single Character 1
                 re4='.*?'	# Non-greedy match on filler
                 re5='htm'	# Uninteresting: word
                 re6='.*?'	# Non-greedy match on filler
                 re7='(htm)'	# Word 1
                 
                 rg = re.compile(re1+re2+re3+re4+re5+re6+re7,re.IGNORECASE|re.DOTALL)
                 m = rg.search(txt)
                 if m:
                     var1=m.group(1)
                     c1=m.group(2)
                     word1=m.group(3)
                     print "("+var1+")"+"("+c1+")"+"("+word1+")"+"\n"
                     url = item1['src']
                 else:
                     url = 'http://www.pdfdownload.org/pdf2html/pdf2html.php?url=' + item1['src'] + '&images=yes'

                 print 'url is: ', url

                 title       = self.tag_to_string(itemcomp)+ ' - ' + self.tag_to_string(item)
                 print 'title is: ', title
                 current_articles.append({'title': title, 'url': url, 'description':'', 'date':''}) # append all this
            
           
        return current_articles

Last edited by marbs; 10-26-2010 at 01:24 AM.
marbs is offline   Reply With Quote
Old 10-26-2010, 09:44 AM   #45
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by marbs View Post
this is as far as i got. i dont think i am posting my request correctly. or i may not be reading in correctly. ... hint?
Without studying your page and code, I can't help a lot. However, you need to match what happens in the browser that works (typically, FireFox). In the browser, you can watch http action with the TamperData or LiveHTTP Headers plugin. You can recreate that action with Mechanize in the recipe, but you need something that lets you verify what you are sending out, and how the site is responding. I use these options:
Code:
# Log information about HTTP redirects and Refreshes.
br.set_debug_redirects(True)
# Log HTTP response bodies (ie. the HTML, most of the time).
br.set_debug_responses(True)
# Print HTTP headers.
br.set_debug_http(True)
Comparing what you see in the recipe to what you see in Tamper?Data should get you to the "next 30" in your recipe, then you just need to make sure it's being assembled into a single page correctly.
Starson17 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
New recipe kiklop74 Recipes 0 10-05-2010 04:41 PM
New recipe kiklop74 Recipes 0 10-01-2010 02:42 PM
New Title from Book View Cafe: A Princess of Passyunk by Maya Kaathryn Bohnhoff suelange Self-Promotions by Authors and Publishers 0 08-11-2010 04:35 PM
Recipe Help lrain5 Calibre 3 05-09-2010 10:42 PM
Recipe Help Please estral Calibre 1 06-11-2009 02:35 PM


All times are GMT -4. The time now is 11:30 AM.


MobileRead.com is a privately owned, operated and funded community.