help! how to handle multi page topic

zhixiangpan · 08-30-2011, 04:44 AM

Hi,

there are some rss instance refer to a topic divided in multiple pages, can Calibre handle it? if yes, how to write the recipe?

thanks!

zxpan

Starson17 · 08-30-2011, 11:50 AM

Quote:

Originally Posted by zhixiangpan

there are some rss instance refer to a topic divided in multiple pages, can Calibre handle it? if yes, how to write the recipe?

Yes, Calibre can handle multipage sites. Search here for "multipage." Search here and in the builtin recipes for "append_page" and see the AdventureGamers recipe.

zhixiangpan · 08-31-2011, 01:06 AM

Dear Starson17:

Thank you for the help. I had check some post about append_pages code, but i don't know how to write the code fetch the hyperlink. the link in my page is like below.

<center>
<table border="0" align="center">
<tbody>
<tr>
<td>
<a href="/GB/14562/15549575.html">
<img src="/img/next_b.gif" border="0"/>
</a>
</td>
</tr>
</tbody>
</table>
</center>

Can you help me?

thx

Starson17 · 08-31-2011, 02:55 PM

Quote:

Originally Posted by zhixiangpan

Dear Starson17:

Thank you for the help. I had check some post about append_pages code, but i don't know how to write the code fetch the hyperlink. the link in my page is like below.

Code:

<center>
<table border="0" align="center">
<tbody>
<tr>
<td>
<a href="/GB/14562/15549575.html">
<img src="/img/next_b.gif" border="0"/>
</a>
</td>
</tr>
</tbody>
</table>
</center>

Can you help me?

Without looking closely at your page, I can't be sure, but something like this may work:

Code:

        pager = soup.find('a')
        if pager.img['src'] == "/img/next_b.gif":
           nexturl = self.INDEX + pager.a['href']

Find the <a> tag, see if it has an <img> tag that points to the "next" image (whatever that is), and if so, grab the href and append it to the INDEX.

If you don't know what pager is, see the various recipes that use append_page.
I hate posting code without testing it, so that part is up to you.

zhixiangpan · 08-31-2011, 09:46 PM

Hi, Starson17:

Thanks, but I still need help.

this is my code

class peoplenetrecipe(BasicNewsRecipe):
title = '人民网'
__author__ = 'me'
oldest_article = 3
max_articles_per_feed = 25

feeds = [
('china', 'http://www.people.com.cn/rss/politics.xml'),
('world', 'http://www.people.com.cn/rss/world.xml'),
('finance', 'http://www.people.com.cn/rss/finance.xml'),
('sport', 'http://www.people.com.cn/rss/sports.xml'),
]

no_stylesheets = True
# remove_javascript = True
# encoding = 'UTF-8'

keep_only_tags = [
dict(name='div', attrs={'class':'c_l fl'}),
]
remove_tags = [
dict(name='div', attrs={'class':'tools'}),
dict(name='div', attrs={'class':'box'}),
]
remove_tags_after = [
dict(name='div', attrs={'class':'show_text'}),
]

def append_page(self, soup, appendtag, position):

pager = soup.find('a')
if pager.img['src'] == "/img/next_b.gif":
nexturl = self.INDEX + pager.a['href']

# pager = soup.find('a',attrs={'class':'nextPage greyButton'}) # here is pager
# if pager:
# nexturl = self.INDEX + pager.a['href']
soup2 = self.index_to_soup(nexturl)
texttag = soup2.find('div', attrs={'class':'c_l fl'}) # here is text
for it in texttag.findAll(style=True):
del it['style']
newpos = len(texttag.contents)
self.append_page(soup2,texttag,newpos)
texttag.extract()
appendtag.insert(position,texttag)

it seems not work, the page

http://politics.people.com.cn/GB/1024/15556053.html

is in Chinese, at bottom there is a link to next page, code is

<a href="/GB/1024/15556054.html">
<img src="/img/next_b.gif" border="0"/>

I don't know how to debug the recipe. so, would you pls help to check it?

Thanks

BR

08-30-2011, 04:44 AM	#1
zhixiangpan Junior Member Posts: 3 Karma: 10 Join Date: Aug 2011 Device: kindle	help! how to handle multi page topic Hi, there are some rss instance refer to a topic divided in multiple pages, can Calibre handle it? if yes, how to write the recipe? thanks! zxpan

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Unutterably Silly Corrupt-A-Topic (anymore off-topic and it would be on-topic)	RWood	Lounge	6227	08-18-2023 10:58 PM
How to handle badly formed xml from web page?	kiwidude	Development	6	02-19-2011 12:05 AM
multi-page HTML with images to ePub or LRF	Nvidiot	Workshop	19	07-13-2009 07:20 PM
how to handle one book has multi-files ?	zhanglong	Calibre	5	03-27-2009 11:47 PM
converting multi-page HTML to Mobipocket	shinew	Calibre	13	02-21-2009 01:33 PM

08-31-2011, 01:06 AM	#3
zhixiangpan Junior Member Posts: 3 Karma: 10 Join Date: Aug 2011 Device: kindle	Dear Starson17: Thank you for the help. I had check some post about append_pages code, but i don't know how to write the code fetch the hyperlink. the link in my page is like below. <center> <table border="0" align="center"> <tbody> <tr> <td> <a href="/GB/14562/15549575.html"> <img src="/img/next_b.gif" border="0"/> </a> </td> </tr> </tbody> </table> </center> Can you help me? thx

08-31-2011, 09:46 PM	#5
zhixiangpan Junior Member Posts: 3 Karma: 10 Join Date: Aug 2011 Device: kindle	Hi, Starson17: Thanks, but I still need help. this is my code class peoplenetrecipe(BasicNewsRecipe): title = '人民网' __author__ = 'me' oldest_article = 3 max_articles_per_feed = 25 feeds = [ ('china', 'http://www.people.com.cn/rss/politics.xml'), ('world', 'http://www.people.com.cn/rss/world.xml'), ('finance', 'http://www.people.com.cn/rss/finance.xml'), ('sport', 'http://www.people.com.cn/rss/sports.xml'), ] no_stylesheets = True # remove_javascript = True # encoding = 'UTF-8' keep_only_tags = [ dict(name='div', attrs={'class':'c_l fl'}), ] remove_tags = [ dict(name='div', attrs={'class':'tools'}), dict(name='div', attrs={'class':'box'}), ] remove_tags_after = [ dict(name='div', attrs={'class':'show_text'}), ] def append_page(self, soup, appendtag, position): pager = soup.find('a') if pager.img['src'] == "/img/next_b.gif": nexturl = self.INDEX + pager.a['href'] # pager = soup.find('a',attrs={'class':'nextPage greyButton'}) # here is pager # if pager: # nexturl = self.INDEX + pager.a['href'] soup2 = self.index_to_soup(nexturl) texttag = soup2.find('div', attrs={'class':'c_l fl'}) # here is text for it in texttag.findAll(style=True): del it['style'] newpos = len(texttag.contents) self.append_page(soup2,texttag,newpos) texttag.extract() appendtag.insert(position,texttag) it seems not work, the page http://politics.people.com.cn/GB/1024/15556053.html is in Chinese, at bottom there is a link to next page, code is <a href="/GB/1024/15556054.html"> <img src="/img/next_b.gif" border="0"/> I don't know how to debug the recipe. so, would you pls help to check it? Thanks BR

Advert

Advert