08-14-2012, 06:56 AM | #1 |
Groupie
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
|
Fetching multi-page articles (solved)
UPDATE: Problem solved thanks to kiklop74.
Also many thanks to lrui (who also spent a lot of time looking into the issue) and kovidgoyal. I've read previous threads about multi-page fetching, but they didn't solve the problem I have right now. I've been trying to fetch articles from a website. If an article has only one page, all is well. If ,however, there is more than one page in an article: 1. There is a clickable "previous page" button on very page, even on the first page (in this case, clicking this button takes you to the same link you are browsing) 2. Likewise, there is a clickable "next page" button even on the last page (when you click "next page" when you are already on the last page, it simply returns you to the last page) 3. There's no option for "single page". 1 and 2 makes it difficult to fetch multi-page articles using append_page. Here's how the page buttons look like (on the first page of a four-page article): Code:
<div id="pages" class="text-c"> <a class="a1" href="original link">previous page</a> <span>1</span> <a href="original link + &page=2">2</a> <a href="original link + &page=3">3</a> <a href="original link + &page=4">4</a> <a class="a1" href="original link + &page=2">next page</a></div> (To make it clearer, I replaced the actual article link with "original link". original link + &page=2 is actually something like http://.......&id=2352&page=2) Therefore it's something like: previous page 1 2 3 4 next page 6 buttons on every page Anyone can tell me how I should revise the recipe to fetch all pages? Last edited by Steven630; 08-22-2012 at 06:59 AM. Reason: Problem solved thanks to everyone's help. |
08-14-2012, 09:45 AM | #2 | |
Enthusiast
Posts: 49
Karma: 475062
Join Date: Aug 2012
Device: nook simple touch
|
Quote:
Here is a good example i find in the recipe of AdventureGamers which can serve as reference material for you. Code:
def append_page(self, soup, appendtag, position): pager = soup.find('div', attrs={'class':'pagination_big'}) if pager: nextpage = soup.find('a', attrs={'class':'next-page'}) if nextpage: nexturl = nextpage['href'] soup2 = self.index_to_soup(nexturl) texttag = soup2.find('div', attrs={'class':'bodytext'}) for it in texttag.findAll(style=True): del it['style'] newpos = len(texttag.contents) self.append_page(soup2,texttag,newpos) texttag.extract() pager.extract() appendtag.insert(position,texttag) Last edited by lrui; 08-14-2012 at 09:27 PM. |
|
Advert | |
|
08-14-2012, 10:15 PM | #3 |
Enthusiast
Posts: 49
Karma: 475062
Join Date: Aug 2012
Device: nook simple touch
|
Code:
def append_page(self, soup, appendtag, position): pager = soup.find('div', attrs={'class':'text-c'}) if pager: nextpage = soup.find('a', attrs={'class':'a1'}) if nextpage: nexturl = nextpage['href'] soup2 = self.index_to_soup(nexturl) texttag = soup2.find('div', attrs={'class':'content_left_5'}) for it in texttag.findAll(style=True): del it['style'] newpos = len(texttag.contents) self.append_page(soup2,texttag,newpos) texttag.extract() pager.extract() appendtag.insert(position,texttag) Last edited by lrui; 08-14-2012 at 10:22 PM. |
08-15-2012, 02:25 AM | #4 |
Groupie
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
|
Thank you, lrui. I did try to make a recipe by imitating that of AdventureGamers before starting the thread, but it didn't work. And your recipe failed as well. I think this is because:
a. Since find... only finds the first object and stops, it will only find the "original link", not the link of next page. (That's the result of the fact that "previous page" button even exists on the first page, an anomaly that other websites don't have.) b. The fact that "next page" button appears even on the last page would mean that a "nexturl" would always be found. (The recipe assumes that the "next page" button would not appear on the last page, or is unclikable. But the here there's no way to tell Calibre that it has already fetched all pages, and it would just go in loops, fetching the last page all the time). In order to get around "a" and "b". I've tried something like this: Code:
def append_page(self, soup, appendtag, position): pager = soup.find('a', attrs={'class':'a1'}) if pager: pt = pager.findNextSibling('a') nexturl = pt['href'] soup2 = self.index_to_soup(nexturl) texttag = soup2.find('div', attrs={'class':'content_left_5'}) newpos = len(texttag.contents) self.append_page(soup2,texttag,newpos) texttag.extract() appendtag.insert(position,texttag) def preprocess_html(self, soup): self.append_page(soup, soup.body, 3) pager = soup.find('div', attrs={'class':'text-c'}) if pager: pager.extract() return self.adeify_images(soup) Which makes me wonder whether the recipe is applicable in the first place. The recipe of AdventureGamers is based on rss, while my recipe is based on index-parsing. This may explain the failure of the method. All previous discussions on multi-page fetching appears to mention AdventureGamers recipe somehow. But nobody seemed to have succeeded. Given the unusualness of the specific article I'm trying to fetch, I don't think the method is going to work even it works on other websites. Last edited by Steven630; 08-15-2012 at 02:59 AM. |
08-15-2012, 04:23 AM | #5 | |
Enthusiast
Posts: 49
Karma: 475062
Join Date: Aug 2012
Device: nook simple touch
|
Quote:
so there is some issue with your code and mine. i think you can use match_regexps to match the next page link and set recursions to some number. #Only one of BasicNewsRecipe.match_regexps or BasicNewsRecipe.filter_regexps should be defined. match_regexps = [r'&page=[0-9]+'] Code:
original-link&page=4 original-link&page=3 original-link&page=2 original-link&page=1 recursions = 1 http://manual.calibre-ebook.com/news....match_regexps Last edited by lrui; 08-15-2012 at 05:59 AM. Reason: attached oversized image |
|
Advert | |
|
08-15-2012, 06:00 AM | #6 |
Enthusiast
Posts: 49
Karma: 475062
Join Date: Aug 2012
Device: nook simple touch
|
this is another way, you can try it
BEAUTIFUL SOUP DOCUMENTAION http://www.crummy.com/software/Beaut...%20**kwargs%29 Attachment 90791 Code:
def append_page(self, soup, appendtag, position):
pager = soup.find('div', attrs={'class':'text-c'})
if pager:
pagenum = soup.find('span')
nextpage = pagenum.findNextSiblings('a', attrs={'class':'a1'})
if nextpage:
nexturl = nextpage['href']
soup2 = self.index_to_soup(nexturl)
texttag = soup2.find('div', attrs={'class':'content_left_5'})
for it in texttag.findAll(style=True):
del it['style']
newpos = len(texttag.contents)
self.append_page(soup2,texttag,newpos)
texttag.extract()
pager.extract()
appendtag.insert(position,texttag)
Last edited by lrui; 08-15-2012 at 06:09 AM. |
08-15-2012, 09:16 AM | #7 | |
Groupie
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
|
Quote:
After I started downloading, nothing indicated that Calibre had found "span" or "div" etc. I suspect this method won't work however hard we try. That is, it's not two class="a1" or other mistakes that led to the failure, but the method in the first place. (Yes, there are two class="a1", but what counts when you use find... in beautifulsoup is the first one. So the second class="a1" would be ignored when the first one is found.) And in theory at least, your method to find "span" and so on should work, but didn't. What do you think? As for match_regexps, that didn't work either, although I'm not sure if simply adding "match_regexps" and "recursion" to the recipe is enough. Wait, seems that match_regexps is not for multi-page articles in the first place... |
|
08-15-2012, 09:48 AM | #8 | |
Enthusiast
Posts: 49
Karma: 475062
Join Date: Aug 2012
Device: nook simple touch
|
Quote:
pagenum = soup.findAll('span') change soup.find into soup.findAll try it again? |
|
08-15-2012, 11:17 AM | #9 |
Groupie
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
|
I will try again tomorrow, but I don't think this will solve the problem. Findall won't work in this way.By the way, are you Chinese?
|
08-15-2012, 11:27 AM | #10 |
Enthusiast
Posts: 49
Karma: 475062
Join Date: Aug 2012
Device: nook simple touch
|
|
08-15-2012, 11:34 AM | #11 |
Groupie
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
|
No, it's pretty good. Anyway, I'm Chinese too. If nobody else chips in, I guess we can talk in our mother tongue when we find it hard to express ourselves in English. That'll save us a lot of trouble. Nice to see you here.
|
08-16-2012, 05:57 AM | #12 |
Groupie
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
|
As expected, that didn't work.
|
08-16-2012, 12:07 PM | #13 |
Enthusiast
Posts: 49
Karma: 475062
Join Date: Aug 2012
Device: nook simple touch
|
weird,I suggest that you refer to other buildin recipes,uncompresse from the resource in calibre2 directory
grep append or ask help for kovid Last edited by lrui; 08-16-2012 at 12:09 PM. |
08-16-2012, 11:04 PM | #14 |
Groupie
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
|
Thank you. I will take a look at the recipe you mentioned. Kvoid may be too busy to help me out.
|
08-17-2012, 06:55 AM | #15 |
Groupie
Posts: 154
Karma: 10
Join Date: May 2012
Device: Kindle Paperwhite2
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Problem: Recipe for Foreign Affairs not fetching premium articles | besianm | Recipes | 1 | 03-07-2012 04:41 AM |
Calibre fetching the web page | dbip | Calibre | 1 | 02-01-2012 04:13 PM |
Multi page possible? | ProDigit | Sigil | 11 | 12-30-2011 12:13 AM |
Problem with Multi-file News Articles | rozen | Recipes | 1 | 10-14-2011 12:05 PM |
Multi-column articles in PDF | tdido | OpenInkpot | 7 | 06-30-2009 11:13 AM |