Fetching multi-page articles

Steven630 · 08-14-2012, 06:56 AM

UPDATE: Problem solved thanks to kiklop74.

Also many thanks to lrui (who also spent a lot of time looking into the issue) and kovidgoyal.

I've read previous threads about multi-page fetching, but they didn't solve the problem I have right now.
I've been trying to fetch articles from a website. If an article has only one page, all is well. If ,however, there is more than one page in an article:

1. There is a clickable "previous page" button on very page, even on the first page (in this case, clicking this button takes you to the same link you are browsing)

2. Likewise, there is a clickable "next page" button even on the last page (when you click "next page" when you are already on the last page, it simply returns you to the last page)

3. There's no option for "single page".

1 and 2 makes it difficult to fetch multi-page articles using append_page.

Here's how the page buttons look like (on the first page of a four-page article):

Code:

<div id="pages" class="text-c">
<a class="a1" href="original link">previous page</a> <span>1</span>
<a href="original link + &page=2">2</a>
<a href="original link + &page=3">3</a>
<a href="original link + &page=4">4</a>
<a class="a1" href="original link + &page=2">next page</a></div>

(To make it clearer, I replaced the actual article link with "original link". original link + &page=2 is actually something like http://.......&id=2352&page=2)

Therefore it's something like:
previous page 1 2 3 4 next page

6 buttons on every page

Anyone can tell me how I should revise the recipe to fetch all pages?

lrui · 08-14-2012, 09:45 AM

Quote:

Originally Posted by Steven630

I've read previous threads about multi-page fetching, but they didn't solve the problem I have right now.
I've been trying to fetch articles from a website. If an article has only one page, all is well. If ,however, there is more than one page in an article:

1. There is a clickable "previous page" button on very page, even on the first page (in this case, clicking this button takes you to the same link you are browsing)

2. Likewise, there is a clickable "next page" button even on the last page (when you click "next page" when you are already on the last page, it simply returns you to the last page)

3. There's no option for "single page".

1 and 2 makes it difficult to fetch multi-page articles using append_page.

Here's how the page buttons look like (on the first page of a four-page article):

Code:

<div id="pages" class="text-c">
<a class="a1" href="original link">previous page</a> <span>1</span>
<a href="original link + &page=2">2</a>
<a href="original link + &page=3">3</a>
<a href="original link + &page=4">4</a>
<a class="a1" href="original link + &page=2">next page</a></div>

(To make it clearer, I replaced the actual article link with "original link". original link + &page=2 is actually something like http://.......&id=2352&page=2)

Therefore it's something like:
previous page 1 2 3 4 next page

6 buttons on every page

(The article in question is : http://www.ittime.com.cn/index.php?m...tid=29&id=2352) It's in Chinese, I've translated "ä¸Šä¸€é¡µ" and "ä¸‹ä¸€é¡µ" into "previous page" and "next page" in the previous codes.

Anyone can tell me how I should revise the recipe to fetch all pages?

Here is a good example i find in the recipe of AdventureGamers which can serve as reference material for you.

Code:

    def append_page(self, soup, appendtag, position):
        pager = soup.find('div', attrs={'class':'pagination_big'})
        if pager:
           nextpage = soup.find('a', attrs={'class':'next-page'})
           if nextpage:
               nexturl = nextpage['href']
               soup2 = self.index_to_soup(nexturl)
               texttag = soup2.find('div', attrs={'class':'bodytext'})
               for it in texttag.findAll(style=True):
                   del it['style']
               newpos = len(texttag.contents)
               self.append_page(soup2,texttag,newpos)
               texttag.extract()
               pager.extract()
               appendtag.insert(position,texttag)

You could use Firebug to locate above the corresponding label in Firefox，and replace by yours.

lrui · 08-14-2012, 10:15 PM

Code:

    def append_page(self, soup, appendtag, position):
        pager = soup.find('div', attrs={'class':'text-c'})
        if pager:
           nextpage = soup.find('a', attrs={'class':'a1'})
           if nextpage:
               nexturl = nextpage['href']
               soup2 = self.index_to_soup(nexturl)
               texttag = soup2.find('div', attrs={'class':'content_left_5'})
               for it in texttag.findAll(style=True):
                   del it['style']
               newpos = len(texttag.contents)
               self.append_page(soup2,texttag,newpos)
               texttag.extract()
               pager.extract()
               appendtag.insert(position,texttag)

I wonder if it works？

Steven630 · 08-15-2012, 02:25 AM

Thank you, lrui. I did try to make a recipe by imitating that of AdventureGamers before starting the thread, but it didn't work. And your recipe failed as well. I think this is because:

a. Since find... only finds the first object and stops, it will only find the "original link", not the link of next page. (That's the result of the fact that "previous page" button even exists on the first page, an anomaly that other websites don't have.)

b. The fact that "next page" button appears even on the last page would mean that a "nexturl" would always be found. (The recipe assumes that the "next page" button would not appear on the last page, or is unclikable. But the here there's no way to tell Calibre that it has already fetched all pages, and it would just go in loops, fetching the last page all the time).

In order to get around "a" and "b". I've tried something like this:

Code:

    def append_page(self, soup, appendtag, position):
        pager = soup.find('a', attrs={'class':'a1'})
        if pager:
           pt = pager.findNextSibling('a')
           nexturl = pt['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'class':'content_left_5'})
           newpos = len(texttag.contents)
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)

    def preprocess_html(self, soup):
        self.append_page(soup, soup.body, 3)
        pager = soup.find('div', attrs={'class':'text-c'})
        if pager:
           pager.extract()
        return self.adeify_images(soup)

Anyway, this method would in theory at least fetch the second page. But while trying it out, I found no sign whatsoever of it making a difference. The log, the downloaded file—all seems extactly the same as if the code were not applied at all.

Which makes me wonder whether the recipe is applicable in the first place. The recipe of AdventureGamers is based on rss, while my recipe is based on index-parsing. This may explain the failure of the method.

All previous discussions on multi-page fetching appears to mention AdventureGamers recipe somehow. But nobody seemed to have succeeded. Given the unusualness of the specific article I'm trying to fetch, I don't think the method is going to work even it works on other websites.

lrui · 08-15-2012, 04:23 AM

Quote:

Originally Posted by Steven630

Thank you, lrui. I did try to make a recipe by imitating that of AdventureGamers before starting the thread, but it didn't work. And your recipe failed as well. I think this is because:

a. Since find... only finds the first object and stops, it will only find the "original link", not the link of next page. (That's the result of the fact that "previous page" button even exists on the first page, an anomaly that other websites don't have.)

b. The fact that "next page" button appears even on the last page would mean that a "nexturl" would always be found. (The recipe assumes that the "next page" button would not appear on the last page, or is unclikable. But the here there's no way to tell Calibre that it has already fetched all pages, and it would just go in loops, fetching the last page all the time).

In order to get around "a" and "b". I've tried something like this:

Code:

    def append_page(self, soup, appendtag, position):
        pager = soup.find('a', attrs={'class':'a1'})
        if pager:
           pt = pager.findNextSibling('a')
           nexturl = pt['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'class':'content_left_5'})
           newpos = len(texttag.contents)
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)
 
    def preprocess_html(self, soup):
        self.append_page(soup, soup.body, 3)
        pager = soup.find('div', attrs={'class':'text-c'})
        if pager:
           pager.extract()
        return self.adeify_images(soup)

Anyway, this method would in theory at least fetch the second page. But while trying it out, I found no sign whatsoever of it making a difference. The log, the downloaded file—all seems extactly the same as if the code were not applied at all.

Which makes me wonder whether the recipe is applicable in the first place. The recipe of AdventureGamers is based on rss, while my recipe is based on index-parsing. This may explain the failure of the method.

All previous discussions on multi-page fetching appears to mention AdventureGamers recipe somehow. But nobody seemed to have succeeded. Given the unusualness of the specific article I'm trying to fetch, I don't think the method is going to work even it works on other websites.

as you can see blow the pictres,there are two class="a1",but the AdventureGamers only has one class="nextpage"

so there is some issue with your code and mine. i think you can use match_regexps to match the next page link and set recursions to some number.

#Only one of BasicNewsRecipe.match_regexps or BasicNewsRecipe.filter_regexps should be defined.
match_regexps = [r'&page=[0-9]+']

Code:

original-link&page=4
original-link&page=3
original-link&page=2
original-link&page=1

#You should set recursions = 1. recursion= n means that links are followed upto depth n
recursions = 1

http://manual.calibre-ebook.com/news....match_regexps

Click image for larger version

Name: iwb2R8b5RfoMG.png
Views: 309
Size: 22.7 KB
ID: 90791

lrui · 08-15-2012, 06:00 AM

this is another way, you can try it

BEAUTIFUL SOUP DOCUMENTAION
http://www.crummy.com/software/Beaut...%20**kwargs%29

Attachment 90791

Code:

 def append_page(self, soup, appendtag, position):
        pager = soup.find('div', attrs={'class':'text-c'})
        if pager:
           pagenum = soup.find('span')
           nextpage = pagenum.findNextSiblings('a', attrs={'class':'a1'})
           if nextpage:
               nexturl = nextpage['href']
               soup2 = self.index_to_soup(nexturl)
               texttag = soup2.find('div', attrs={'class':'content_left_5'})
               for it in texttag.findAll(style=True):
                   del it['style']
               newpos = len(texttag.contents)
               self.append_page(soup2,texttag,newpos)
               texttag.extract()
               pager.extract()
               appendtag.insert(position,texttag)

i don‘t know whether it works，please tell me.

Steven630 · 08-15-2012, 09:16 AM

Quote:

Originally Posted by lrui

this is another way, you can try it

BEAUTIFUL SOUP DOCUMENTAION
http://www.crummy.com/software/Beaut...%20**kwargs%29

Attachment 90791

Code:

 def append_page(self, soup, appendtag, position):
        pager = soup.find('div', attrs={'class':'text-c'})
        if pager:
           pagenum = soup.find('span')
           nextpage = pagenum.findNextSiblings('a', attrs={'class':'a1'})
           if nextpage:
               nexturl = nextpage['href']
               soup2 = self.index_to_soup(nexturl)
               texttag = soup2.find('div', attrs={'class':'content_left_5'})
               for it in texttag.findAll(style=True):
                   del it['style']
               newpos = len(texttag.contents)
               self.append_page(soup2,texttag,newpos)
               texttag.extract()
               pager.extract()
               appendtag.insert(position,texttag)

i don‘t know whether it works，please tell me.

Thanks. Yet again, it failed.

After I started downloading, nothing indicated that Calibre had found "span" or "div" etc. I suspect this method won't work however hard we try. That is, it's not two class="a1" or other mistakes that led to the failure, but the method in the first place. (Yes, there are two class="a1", but what counts when you use find... in beautifulsoup is the first one. So the second class="a1" would be ignored when the first one is found.) And in theory at least, your method to find "span" and so on should work, but didn't. What do you think?

As for match_regexps, that didn't work either, although I'm not sure if simply adding "match_regexps" and "recursion" to the recipe is enough. Wait, seems that match_regexps is not for multi-page articles in the first place...

lrui · 08-15-2012, 09:48 AM

Quote:

Originally Posted by Steven630

Thanks. Yet again, it failed.

After I started downloading, nothing indicated that Calibre had found "span" or "div" etc. I suspect this method won't work however hard we try. That is, it's not two class="a1" or other mistakes that led to the failure, but the method in the first place. (Yes, there are two class="a1", but what counts when you use find... in beautifulsoup is the first one. So the second class="a1" would be ignored when the first one is found.) And in theory at least, your method to find "span" and so on should work, but didn't. What do you think?

As for match_regexps, that didn't work either, although I'm not sure if simply adding "match_regexps" and "recursion" to the recipe is enough. Wait, seems that match_regexps is not for multi-page articles in the first place...

pagenum = soup.findAll('span')

change soup.find into soup.findAll

try it again?

Steven630 · 08-15-2012, 11:17 AM

I will try again tomorrow, but I don't think this will solve the problem. Findall won't work in this way.By the way, are you Chinese?

lrui · 08-15-2012, 11:27 AM

Quote:

Originally Posted by Steven630

I will try again tomorrow, but I don't think this will solve the problem. Findall won't work in this way.By the way, are you Chinese?

yes，My English is rather shaky.

Steven630 · 08-15-2012, 11:34 AM

No, it's pretty good. Anyway, I'm Chinese too. If nobody else chips in, I guess we can talk in our mother tongue when we find it hard to express ourselves in English. That'll save us a lot of trouble. Nice to see you here.

Steven630 · 08-16-2012, 05:57 AM

As expected, that didn't work.

lrui · 08-16-2012, 12:07 PM

Quote:

Originally Posted by Steven630

As expected, that didn't work.

weird，I suggest that you refer to other buildin recipes，uncompresse from the resource in calibre2 directory

grep append

or ask help for kovid

Steven630 · 08-16-2012, 11:04 PM

Thank you. I will take a look at the recipe you mentioned. Kvoid may be too busy to help me out.

Steven630 · 08-17-2012, 06:55 AM

Quote:

Originally Posted by lrui

weird，I suggest that you refer to other buildin recipes，uncompresse from the resource in calibre2 directory

grep append

or ask help for kovid

What exactly is the name of the recipe? grep append?

08-14-2012, 06:56 AM	#1
Steven630 Groupie Posts: 154 Karma: 10 Join Date: May 2012 Device: Kindle Paperwhite2	Fetching multi-page articles (solved) UPDATE: Problem solved thanks to kiklop74. Also many thanks to lrui (who also spent a lot of time looking into the issue) and kovidgoyal. I've read previous threads about multi-page fetching, but they didn't solve the problem I have right now. I've been trying to fetch articles from a website. If an article has only one page, all is well. If ,however, there is more than one page in an article: 1. There is a clickable "previous page" button on very page, even on the first page (in this case, clicking this button takes you to the same link you are browsing) 2. Likewise, there is a clickable "next page" button even on the last page (when you click "next page" when you are already on the last page, it simply returns you to the last page) 3. There's no option for "single page". 1 and 2 makes it difficult to fetch multi-page articles using append_page. Here's how the page buttons look like (on the first page of a four-page article): Code: <div id="pages" class="text-c"> <a class="a1" href="original link">previous page</a> <span>1</span> <a href="original link + &page=2">2</a> <a href="original link + &page=3">3</a> <a href="original link + &page=4">4</a> <a class="a1" href="original link + &page=2">next page</a></div> (To make it clearer, I replaced the actual article link with "original link". original link + &page=2 is actually something like http://.......&id=2352&page=2) Therefore it's something like: previous page 1 2 3 4 next page 6 buttons on every page Anyone can tell me how I should revise the recipe to fetch all pages? Last edited by Steven630; 08-22-2012 at 06:59 AM. Reason: Problem solved thanks to everyone's help.

08-15-2012, 02:25 AM	#4
Steven630 Groupie Posts: 154 Karma: 10 Join Date: May 2012 Device: Kindle Paperwhite2	Thank you, lrui. I did try to make a recipe by imitating that of AdventureGamers before starting the thread, but it didn't work. And your recipe failed as well. I think this is because: a. Since find... only finds the first object and stops, it will only find the "original link", not the link of next page. (That's the result of the fact that "previous page" button even exists on the first page, an anomaly that other websites don't have.) b. The fact that "next page" button appears even on the last page would mean that a "nexturl" would always be found. (The recipe assumes that the "next page" button would not appear on the last page, or is unclikable. But the here there's no way to tell Calibre that it has already fetched all pages, and it would just go in loops, fetching the last page all the time). In order to get around "a" and "b". I've tried something like this: Code: def append_page(self, soup, appendtag, position): pager = soup.find('a', attrs={'class':'a1'}) if pager: pt = pager.findNextSibling('a') nexturl = pt['href'] soup2 = self.index_to_soup(nexturl) texttag = soup2.find('div', attrs={'class':'content_left_5'}) newpos = len(texttag.contents) self.append_page(soup2,texttag,newpos) texttag.extract() appendtag.insert(position,texttag) def preprocess_html(self, soup): self.append_page(soup, soup.body, 3) pager = soup.find('div', attrs={'class':'text-c'}) if pager: pager.extract() return self.adeify_images(soup) Anyway, this method would in theory at least fetch the second page. But while trying it out, I found no sign whatsoever of it making a difference. The log, the downloaded file—all seems extactly the same as if the code were not applied at all. Which makes me wonder whether the recipe is applicable in the first place. The recipe of AdventureGamers is based on rss, while my recipe is based on index-parsing. This may explain the failure of the method. All previous discussions on multi-page fetching appears to mention AdventureGamers recipe somehow. But nobody seemed to have succeeded. Given the unusualness of the specific article I'm trying to fetch, I don't think the method is going to work even it works on other websites. Last edited by Steven630; 08-15-2012 at 02:59 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Problem: Recipe for Foreign Affairs not fetching premium articles	besianm	Recipes	1	03-07-2012 04:41 AM
Calibre fetching the web page	dbip	Calibre	1	02-01-2012 04:13 PM
Multi page possible?	ProDigit	Sigil	11	12-30-2011 12:13 AM
Problem with Multi-file News Articles	rozen	Recipes	1	10-14-2011 12:05 PM
Multi-column articles in PDF	tdido	OpenInkpot	7	06-30-2009 11:13 AM

08-15-2012, 11:17 AM	#9
Steven630 Groupie Posts: 154 Karma: 10 Join Date: May 2012 Device: Kindle Paperwhite2	I will try again tomorrow, but I don't think this will solve the problem. Findall won't work in this way.By the way, are you Chinese?

08-15-2012, 11:34 AM	#11
Steven630 Groupie Posts: 154 Karma: 10 Join Date: May 2012 Device: Kindle Paperwhite2	No, it's pretty good. Anyway, I'm Chinese too. If nobody else chips in, I guess we can talk in our mother tongue when we find it hard to express ourselves in English. That'll save us a lot of trouble. Nice to see you here.

08-16-2012, 05:57 AM	#12
Steven630 Groupie Posts: 154 Karma: 10 Join Date: May 2012 Device: Kindle Paperwhite2	As expected, that didn't work.

08-16-2012, 11:04 PM	#14
Steven630 Groupie Posts: 154 Karma: 10 Join Date: May 2012 Device: Kindle Paperwhite2	Thank you. I will take a look at the recipe you mentioned. Kvoid may be too busy to help me out.

Advert

Advert