Fetching multi-page articles - Page 2

lrui · 08-17-2012, 09:13 AM

Quote:

Originally Posted by Steven630

What exactly is the name of the recipe? grep append?

sorry，用中文说吧，全局搜索，我用的emeditor，在文件中查找，appen关键词

Steven630 · 08-17-2012, 10:00 AM

Got it. Thanks.

Steven630 · 08-18-2012, 03:02 AM

Anyone knows a recipe that uses both index-parsing (as against rss) and multi-page fetching?

kovidgoyal · 08-18-2012, 03:33 AM

index parsing has no bearing on multipage. What method you use to create the index does not affect multipage in any way.

Steven630 · 08-18-2012, 03:38 AM

Quote:

Originally Posted by kovidgoyal

index parsing has no bearing on multipage. What method you use to create the index does not affect multipage in any way.

Thank you. That at least keeps me working on the method. Then how do I know if pager is found or not? (With or without codes related to multi-page fetching, the log and the file produced look exactly the same.)

kovidgoyal · 08-18-2012, 03:41 AM

Use print statements in your recipe.

Steven630 · 08-18-2012, 03:44 AM

Quote:

Originally Posted by kovidgoyal

Use print statements in your recipe.

Thanks. Since there's a "next page" button even on the last page. Is there anyway I can let Calibre to know that it's actually the last page? (Like comparing the contents and see if they are the same)

kovidgoyal · 08-18-2012, 03:57 AM

Use the source, Luke: http://bazaar.launchpad.net/~kovid/c.../feeds/news.py in particular look at the is_link_wanted() function.

Steven630 · 08-18-2012, 04:11 AM

Quote:

Originally Posted by kovidgoyal

Use the source, Luke: http://bazaar.launchpad.net/~kovid/c.../feeds/news.py in particular look at the is_link_wanted() function.

Thanks. But seems that the next-page link on the last page cannot be simply filtered out since it's identical to previous links (just that it's redundant).

Speaking of 'print', will something like this do?

Code:

    def append_page(self, soup, appendtag, position):
        pager = ...
        if pager:
           self.log('Found pager')
...

Still, it failed to find the pager in the first place.

Could it be possible that the "soup" in index-parsing and the "soup" in append_page are confused? (So it's looking for the pager in the index page rather than the article page)

Steven630 · 08-21-2012, 12:07 PM

Help!

kiklop74 · 08-21-2012, 02:57 PM

Try with these changes:

Code:

    def append_page(self, soup, appendtag, position, surl):
        pager = soup.find('div', attrs={'id':'pages'})
        if pager:
           nextpages = soup.findAll('a', attrs={'class':'a1'})
           nextpage = nextpages[1]
           if nextpage and (nextpage['href'] != surl):
               nexturl = nextpage['href']
               soup2 = self.index_to_soup(nexturl)
               texttag = soup2.find('div', attrs={'class':'content_left_5'})
               for it in texttag.findAll(style=True):
                   del it['style']
               newpos = len(texttag.contents)
               self.append_page(soup2,texttag,newpos,nexturl)
               texttag.extract()
               pager.extract()
               appendtag.insert(position,texttag)


    def preprocess_html(self, soup):
        self.append_page(soup, soup.body, 3, '')
        pager = soup.find('div', attrs={'id':'pages'})
        if pager:
           pager.extract()
        return self.adeify_images(soup)

Notice that I changed append_page to contain new parameter. That should be used to pass the current page URL. You use that later to check if the URL of page who called the method is the same or not to the one in pager. If it is the same the recursion is stopped.

Also verify the pager tag I use for searching and texttag you can experiment with those accordingly.

Steven630 · 08-21-2012, 10:06 PM

Quote:

Originally Posted by kiklop74

Try with these changes:
...

Notice that I changed append_page to contain new parameter. That should be used to pass the current page URL. You use that later to check if the URL of page who called the method is the same or not to the one in pager. If it is the same the recursion is stopped.

Also verify the pager tag I use for searching and texttag you can experiment with those accordingly.

I can't thank you enough. It works!

(And I feel like such an idiot after so many false starts involving a silly mistake by me.)

UPDATE: Problem solved thanks to kiklop74.

Also many thanks to lrui (who also spent a lot of time looking into the issue) and kovidgoyal.

lrui · 08-21-2012, 11:04 PM

post your recipe

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Problem: Recipe for Foreign Affairs not fetching premium articles	besianm	Recipes	1	03-07-2012 05:41 AM
Calibre fetching the web page	dbip	Calibre	1	02-01-2012 05:13 PM
Multi page possible?	ProDigit	Sigil	11	12-30-2011 01:13 AM
Problem with Multi-file News Articles	rozen	Recipes	1	10-14-2011 01:05 PM
Multi-column articles in PDF	tdido	OpenInkpot	7	06-30-2009 12:13 PM

08-17-2012, 10:00 AM	#17
Steven630 Groupie Posts: 180 Karma: 10 Join Date: May 2012 Device: Kindle Paperwhite2	Got it. Thanks.

08-18-2012, 03:02 AM	#18
Steven630 Groupie Posts: 180 Karma: 10 Join Date: May 2012 Device: Kindle Paperwhite2	Anyone knows a recipe that uses both index-parsing (as against rss) and multi-page fetching?

08-18-2012, 03:33 AM	#19
kovidgoyal creator of calibre Posts: 46,020 Karma: 29579868 Join Date: Oct 2006 Location: Mumbai, India Device: Various	index parsing has no bearing on multipage. What method you use to create the index does not affect multipage in any way.

08-18-2012, 03:41 AM	#21
kovidgoyal creator of calibre Posts: 46,020 Karma: 29579868 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Use print statements in your recipe.

08-18-2012, 03:57 AM	#23
kovidgoyal creator of calibre Posts: 46,020 Karma: 29579868 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Use the source, Luke: http://bazaar.launchpad.net/~kovid/c.../feeds/news.py in particular look at the is_link_wanted() function.

08-21-2012, 12:07 PM	#25
Steven630 Groupie Posts: 180 Karma: 10 Join Date: May 2012 Device: Kindle Paperwhite2	Help!

08-21-2012, 11:04 PM	#28
lrui Enthusiast Posts: 49 Karma: 475062 Join Date: Aug 2012 Device: nook simple touch	post your recipe