Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 08-14-2012, 06:56 AM   #1
Steven630
Groupie
Steven630 began at the beginning.
 
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
Fetching multi-page articles (solved)

UPDATE: Problem solved thanks to kiklop74.

Also many thanks to lrui (who also spent a lot of time looking into the issue) and kovidgoyal.



I've read previous threads about multi-page fetching, but they didn't solve the problem I have right now.
I've been trying to fetch articles from a website. If an article has only one page, all is well. If ,however, there is more than one page in an article:

1. There is a clickable "previous page" button on very page, even on the first page (in this case, clicking this button takes you to the same link you are browsing)

2. Likewise, there is a clickable "next page" button even on the last page (when you click "next page" when you are already on the last page, it simply returns you to the last page)

3. There's no option for "single page".

1 and 2 makes it difficult to fetch multi-page articles using append_page.

Here's how the page buttons look like (on the first page of a four-page article):
Code:
<div id="pages" class="text-c">
<a class="a1" href="original link">previous page</a> <span>1</span>
<a href="original link + &page=2">2</a>
<a href="original link + &page=3">3</a>
<a href="original link + &page=4">4</a>
<a class="a1" href="original link + &page=2">next page</a></div>


(To make it clearer, I replaced the actual article link with "original link". original link + &page=2 is actually something like http://.......&id=2352&page=2)


Therefore it's something like:
previous page 1 2 3 4 next page

6 buttons on every page


Anyone can tell me how I should revise the recipe to fetch all pages?

Last edited by Steven630; 08-22-2012 at 06:59 AM. Reason: Problem solved thanks to everyone's help.
Steven630 is offline   Reply With Quote
Old 08-14-2012, 09:45 AM   #2
lrui
Enthusiast
lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.
 
lrui's Avatar
 
Posts: 49
Karma: 475062
Join Date: Aug 2012
Device: nook simple touch
Quote:
Originally Posted by Steven630 View Post
I've read previous threads about multi-page fetching, but they didn't solve the problem I have right now.
I've been trying to fetch articles from a website. If an article has only one page, all is well. If ,however, there is more than one page in an article:

1. There is a clickable "previous page" button on very page, even on the first page (in this case, clicking this button takes you to the same link you are browsing)

2. Likewise, there is a clickable "next page" button even on the last page (when you click "next page" when you are already on the last page, it simply returns you to the last page)

3. There's no option for "single page".

1 and 2 makes it difficult to fetch multi-page articles using append_page.

Here's how the page buttons look like (on the first page of a four-page article):
Code:
<div id="pages" class="text-c">
<a class="a1" href="original link">previous page</a> <span>1</span>
<a href="original link + &page=2">2</a>
<a href="original link + &page=3">3</a>
<a href="original link + &page=4">4</a>
<a class="a1" href="original link + &page=2">next page</a></div>


(To make it clearer, I replaced the actual article link with "original link". original link + &page=2 is actually something like http://.......&id=2352&page=2)


Therefore it's something like:
previous page 1 2 3 4 next page

6 buttons on every page

(The article in question is : http://www.ittime.com.cn/index.php?m...tid=29&id=2352) It's in Chinese, I've translated "上一页" and "下一页" into "previous page" and "next page" in the previous codes.

Anyone can tell me how I should revise the recipe to fetch all pages?

Here is a good example i find in the recipe of AdventureGamers which can serve as reference material for you.

Code:
    def append_page(self, soup, appendtag, position):
        pager = soup.find('div', attrs={'class':'pagination_big'})
        if pager:
           nextpage = soup.find('a', attrs={'class':'next-page'})
           if nextpage:
               nexturl = nextpage['href']
               soup2 = self.index_to_soup(nexturl)
               texttag = soup2.find('div', attrs={'class':'bodytext'})
               for it in texttag.findAll(style=True):
                   del it['style']
               newpos = len(texttag.contents)
               self.append_page(soup2,texttag,newpos)
               texttag.extract()
               pager.extract()
               appendtag.insert(position,texttag)
You could use Firebug to locate above the corresponding label in Firefox,and replace by yours.

Last edited by lrui; 08-14-2012 at 09:27 PM.
lrui is offline   Reply With Quote
Advert
Old 08-14-2012, 10:15 PM   #3
lrui
Enthusiast
lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.
 
lrui's Avatar
 
Posts: 49
Karma: 475062
Join Date: Aug 2012
Device: nook simple touch
Code:
    def append_page(self, soup, appendtag, position):
        pager = soup.find('div', attrs={'class':'text-c'})
        if pager:
           nextpage = soup.find('a', attrs={'class':'a1'})
           if nextpage:
               nexturl = nextpage['href']
               soup2 = self.index_to_soup(nexturl)
               texttag = soup2.find('div', attrs={'class':'content_left_5'})
               for it in texttag.findAll(style=True):
                   del it['style']
               newpos = len(texttag.contents)
               self.append_page(soup2,texttag,newpos)
               texttag.extract()
               pager.extract()
               appendtag.insert(position,texttag)
I wonder if it works?

Last edited by lrui; 08-14-2012 at 10:22 PM.
lrui is offline   Reply With Quote
Old 08-15-2012, 02:25 AM   #4
Steven630
Groupie
Steven630 began at the beginning.
 
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
Thank you, lrui. I did try to make a recipe by imitating that of AdventureGamers before starting the thread, but it didn't work. And your recipe failed as well. I think this is because:

a. Since find... only finds the first object and stops, it will only find the "original link", not the link of next page. (That's the result of the fact that "previous page" button even exists on the first page, an anomaly that other websites don't have.)

b. The fact that "next page" button appears even on the last page would mean that a "nexturl" would always be found. (The recipe assumes that the "next page" button would not appear on the last page, or is unclikable. But the here there's no way to tell Calibre that it has already fetched all pages, and it would just go in loops, fetching the last page all the time).

In order to get around "a" and "b". I've tried something like this:

Code:
    def append_page(self, soup, appendtag, position):
        pager = soup.find('a', attrs={'class':'a1'})
        if pager:
           pt = pager.findNextSibling('a')
           nexturl = pt['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'class':'content_left_5'})
           newpos = len(texttag.contents)
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)

    def preprocess_html(self, soup):
        self.append_page(soup, soup.body, 3)
        pager = soup.find('div', attrs={'class':'text-c'})
        if pager:
           pager.extract()
        return self.adeify_images(soup)
Anyway, this method would in theory at least fetch the second page. But while trying it out, I found no sign whatsoever of it making a difference. The log, the downloaded file—all seems extactly the same as if the code were not applied at all.

Which makes me wonder whether the recipe is applicable in the first place. The recipe of AdventureGamers is based on rss, while my recipe is based on index-parsing. This may explain the failure of the method.

All previous discussions on multi-page fetching appears to mention AdventureGamers recipe somehow. But nobody seemed to have succeeded. Given the unusualness of the specific article I'm trying to fetch, I don't think the method is going to work even it works on other websites.

Last edited by Steven630; 08-15-2012 at 02:59 AM.
Steven630 is offline   Reply With Quote
Old 08-15-2012, 04:23 AM   #5
lrui
Enthusiast
lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.
 
lrui's Avatar
 
Posts: 49
Karma: 475062
Join Date: Aug 2012
Device: nook simple touch
Quote:
Originally Posted by Steven630 View Post
Thank you, lrui. I did try to make a recipe by imitating that of AdventureGamers before starting the thread, but it didn't work. And your recipe failed as well. I think this is because:

a. Since find... only finds the first object and stops, it will only find the "original link", not the link of next page. (That's the result of the fact that "previous page" button even exists on the first page, an anomaly that other websites don't have.)

b. The fact that "next page" button appears even on the last page would mean that a "nexturl" would always be found. (The recipe assumes that the "next page" button would not appear on the last page, or is unclikable. But the here there's no way to tell Calibre that it has already fetched all pages, and it would just go in loops, fetching the last page all the time).

In order to get around "a" and "b". I've tried something like this:

Code:
    def append_page(self, soup, appendtag, position):
        pager = soup.find('a', attrs={'class':'a1'})
        if pager:
           pt = pager.findNextSibling('a')
           nexturl = pt['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'class':'content_left_5'})
           newpos = len(texttag.contents)
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)
 
    def preprocess_html(self, soup):
        self.append_page(soup, soup.body, 3)
        pager = soup.find('div', attrs={'class':'text-c'})
        if pager:
           pager.extract()
        return self.adeify_images(soup)
Anyway, this method would in theory at least fetch the second page. But while trying it out, I found no sign whatsoever of it making a difference. The log, the downloaded file—all seems extactly the same as if the code were not applied at all.

Which makes me wonder whether the recipe is applicable in the first place. The recipe of AdventureGamers is based on rss, while my recipe is based on index-parsing. This may explain the failure of the method.

All previous discussions on multi-page fetching appears to mention AdventureGamers recipe somehow. But nobody seemed to have succeeded. Given the unusualness of the specific article I'm trying to fetch, I don't think the method is going to work even it works on other websites.
as you can see blow the pictres,there are two class="a1",but the AdventureGamers only has one class="nextpage"

so there is some issue with your code and mine. i think you can use match_regexps to match the next page link and set recursions to some number.

#Only one of BasicNewsRecipe.match_regexps or BasicNewsRecipe.filter_regexps should be defined.
match_regexps = [r'&page=[0-9]+']

Code:
original-link&page=4
original-link&page=3
original-link&page=2
original-link&page=1
#You should set recursions = 1. recursion= n means that links are followed upto depth n
recursions = 1

http://manual.calibre-ebook.com/news....match_regexps

Click image for larger version

Name:	iwb2R8b5RfoMG.png
Views:	238
Size:	22.7 KB
ID:	90791

Last edited by lrui; 08-15-2012 at 05:59 AM. Reason: attached oversized image
lrui is offline   Reply With Quote
Advert
Old 08-15-2012, 06:00 AM   #6
lrui
Enthusiast
lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.
 
lrui's Avatar
 
Posts: 49
Karma: 475062
Join Date: Aug 2012
Device: nook simple touch
this is another way, you can try it

BEAUTIFUL SOUP DOCUMENTAION
http://www.crummy.com/software/Beaut...%20**kwargs%29

Attachment 90791
Code:
 def append_page(self, soup, appendtag, position):
        pager = soup.find('div', attrs={'class':'text-c'})
        if pager:
           pagenum = soup.find('span')
           nextpage = pagenum.findNextSiblings('a', attrs={'class':'a1'})
           if nextpage:
               nexturl = nextpage['href']
               soup2 = self.index_to_soup(nexturl)
               texttag = soup2.find('div', attrs={'class':'content_left_5'})
               for it in texttag.findAll(style=True):
                   del it['style']
               newpos = len(texttag.contents)
               self.append_page(soup2,texttag,newpos)
               texttag.extract()
               pager.extract()
               appendtag.insert(position,texttag)
i don‘t know whether it works,please tell me.

Last edited by lrui; 08-15-2012 at 06:09 AM.
lrui is offline   Reply With Quote
Old 08-15-2012, 09:16 AM   #7
Steven630
Groupie
Steven630 began at the beginning.
 
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
Quote:
Originally Posted by lrui View Post
this is another way, you can try it

BEAUTIFUL SOUP DOCUMENTAION
http://www.crummy.com/software/Beaut...%20**kwargs%29

Attachment 90791
Code:
 def append_page(self, soup, appendtag, position):
        pager = soup.find('div', attrs={'class':'text-c'})
        if pager:
           pagenum = soup.find('span')
           nextpage = pagenum.findNextSiblings('a', attrs={'class':'a1'})
           if nextpage:
               nexturl = nextpage['href']
               soup2 = self.index_to_soup(nexturl)
               texttag = soup2.find('div', attrs={'class':'content_left_5'})
               for it in texttag.findAll(style=True):
                   del it['style']
               newpos = len(texttag.contents)
               self.append_page(soup2,texttag,newpos)
               texttag.extract()
               pager.extract()
               appendtag.insert(position,texttag)
i don‘t know whether it works,please tell me.
Thanks. Yet again, it failed.

After I started downloading, nothing indicated that Calibre had found "span" or "div" etc. I suspect this method won't work however hard we try. That is, it's not two class="a1" or other mistakes that led to the failure, but the method in the first place. (Yes, there are two class="a1", but what counts when you use find... in beautifulsoup is the first one. So the second class="a1" would be ignored when the first one is found.) And in theory at least, your method to find "span" and so on should work, but didn't. What do you think?

As for match_regexps, that didn't work either, although I'm not sure if simply adding "match_regexps" and "recursion" to the recipe is enough. Wait, seems that match_regexps is not for multi-page articles in the first place...
Steven630 is offline   Reply With Quote
Old 08-15-2012, 09:48 AM   #8
lrui
Enthusiast
lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.
 
lrui's Avatar
 
Posts: 49
Karma: 475062
Join Date: Aug 2012
Device: nook simple touch
Quote:
Originally Posted by Steven630 View Post
Thanks. Yet again, it failed.

After I started downloading, nothing indicated that Calibre had found "span" or "div" etc. I suspect this method won't work however hard we try. That is, it's not two class="a1" or other mistakes that led to the failure, but the method in the first place. (Yes, there are two class="a1", but what counts when you use find... in beautifulsoup is the first one. So the second class="a1" would be ignored when the first one is found.) And in theory at least, your method to find "span" and so on should work, but didn't. What do you think?

As for match_regexps, that didn't work either, although I'm not sure if simply adding "match_regexps" and "recursion" to the recipe is enough. Wait, seems that match_regexps is not for multi-page articles in the first place...

pagenum = soup.findAll('span')

change soup.find into soup.findAll

try it again?

lrui is offline   Reply With Quote
Old 08-15-2012, 11:17 AM   #9
Steven630
Groupie
Steven630 began at the beginning.
 
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
I will try again tomorrow, but I don't think this will solve the problem. Findall won't work in this way.By the way, are you Chinese?
Steven630 is offline   Reply With Quote
Old 08-15-2012, 11:27 AM   #10
lrui
Enthusiast
lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.
 
lrui's Avatar
 
Posts: 49
Karma: 475062
Join Date: Aug 2012
Device: nook simple touch
Quote:
Originally Posted by Steven630 View Post
I will try again tomorrow, but I don't think this will solve the problem. Findall won't work in this way.By the way, are you Chinese?
yes,My English is rather shaky.
lrui is offline   Reply With Quote
Old 08-15-2012, 11:34 AM   #11
Steven630
Groupie
Steven630 began at the beginning.
 
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
No, it's pretty good. Anyway, I'm Chinese too. If nobody else chips in, I guess we can talk in our mother tongue when we find it hard to express ourselves in English. That'll save us a lot of trouble. Nice to see you here.
Steven630 is offline   Reply With Quote
Old 08-16-2012, 05:57 AM   #12
Steven630
Groupie
Steven630 began at the beginning.
 
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
As expected, that didn't work.
Steven630 is offline   Reply With Quote
Old 08-16-2012, 12:07 PM   #13
lrui
Enthusiast
lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.lrui ought to be getting tired of karma fortunes by now.
 
lrui's Avatar
 
Posts: 49
Karma: 475062
Join Date: Aug 2012
Device: nook simple touch
Quote:
Originally Posted by Steven630 View Post
As expected, that didn't work.
weird,I suggest that you refer to other buildin recipes,uncompresse from the resource in calibre2 directory

grep append


or ask help for kovid

Last edited by lrui; 08-16-2012 at 12:09 PM.
lrui is offline   Reply With Quote
Old 08-16-2012, 11:04 PM   #14
Steven630
Groupie
Steven630 began at the beginning.
 
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
Thank you. I will take a look at the recipe you mentioned. Kvoid may be too busy to help me out.
Steven630 is offline   Reply With Quote
Old 08-17-2012, 06:55 AM   #15
Steven630
Groupie
Steven630 began at the beginning.
 
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
Quote:
Originally Posted by lrui View Post
weird,I suggest that you refer to other buildin recipes,uncompresse from the resource in calibre2 directory

grep append


or ask help for kovid
What exactly is the name of the recipe? grep append?
Steven630 is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Problem: Recipe for Foreign Affairs not fetching premium articles besianm Recipes 1 03-07-2012 04:41 AM
Calibre fetching the web page dbip Calibre 1 02-01-2012 04:13 PM
Multi page possible? ProDigit Sigil 11 12-30-2011 12:13 AM
Problem with Multi-file News Articles rozen Recipes 1 10-14-2011 12:05 PM
Multi-column articles in PDF tdido OpenInkpot 7 06-30-2009 11:13 AM


All times are GMT -4. The time now is 05:22 AM.


MobileRead.com is a privately owned, operated and funded community.