10-28-2013, 05:44 PM | #1 |
Member
Posts: 18
Karma: 10
Join Date: Oct 2013
Device: Kindle
|
Golem.de (german tech news) multipage article
Hi,
I tried to add a method to fetch multipage articles (like this: Golem.de article RSS-feed 'Hardware') like in the 'Adventuer Gamers' recipe. I modified the code like this to fit to the golem homepage: Spoiler:
The problem is, I don't know what I have to insert here: Code:
...
del item['style']
for item in soup.findAll('div', attrs={'class':'floatright'}):
item.extract()
self.append_page(soup, soup.body, 3)
...
In this form it does nothing but adding the page numbers to the end of the article... It is probably pretty simple, but I don't know how to fix it... Can anybody help me? Thanks! |
10-28-2013, 11:27 PM | #2 |
creator of calibre
Posts: 43,778
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
That's the hard way of doing things,instead just implement the is_link_wanted function in your recipe to return True for links calibre should follow and False otherwise.
|
Advert | |
|
10-29-2013, 03:10 AM | #3 |
Member
Posts: 18
Karma: 10
Join Date: Oct 2013
Device: Kindle
|
Oh okay, I didn't know that.
I will try out this later. Thanks! |
10-29-2013, 06:28 AM | #4 |
Member
Posts: 18
Karma: 10
Join Date: Oct 2013
Device: Kindle
|
Okay, I just tried to insert the function, but I am not really sure, how to pass the arguments to the function and exactly what to pass and so on.
I tried it like this: Code:
def is_link_wanted (self, url, tag): tag = dict(name='a', attrs={'id':'jtoc-next'}) if tag: return True else: return False Spoiler:
The bold element is the button for the next page, so the important part. Sadly it doesn't work. So how should it look like? Do I have to insert the url also? But it will look different for every article... I'm a bit uncertain now... But thanks anyway |
10-29-2013, 08:55 AM | #5 |
creator of calibre
Posts: 43,778
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You need to check the value of url and if it is a url you want followed, return True otherwise return False
|
Advert | |
|
10-30-2013, 08:13 AM | #6 |
Member
Posts: 18
Karma: 10
Join Date: Oct 2013
Device: Kindle
|
Yeah that's clear to me, but how should I identify the urls I want? They will be different for every article. The one thing that is always the same, is just the element they are in..
I don't get it - sorry I looked into other recipes where this method is used, but they were all different and I didn't really know how to use transfer them :-S And: should I return 'True' or another value, because in most of the mentioned recipes something different is returned...: Ciekawostki Historyczne: Code:
def is_link_wanted(self, url, tag): return 'ciekawostkihistoryczne' in url and url[-2] in {'2', '3', '4', '5', '6'} Code:
def is_link_wanted(self, url, tag): ans = re.match(r'http://.*/[2-9]/', url) is not None if ans: self.log('Following multipage link: %s'%url) return ans Code:
def is_link_wanted(self, url, tag): if url.endswith('.pdf'): return False return True Code:
def is_link_wanted(self, url, tag): return tag['class'] == 'next' |
10-30-2013, 09:16 AM | #7 |
creator of calibre
Posts: 43,778
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Those are all returning True or False. The various expressions evaluate to True or False.
If you want to chack the element instead of the url, use the tag parameter, that is the element from which the url comes. |
10-30-2013, 12:04 PM | #8 |
Member
Posts: 18
Karma: 10
Join Date: Oct 2013
Device: Kindle
|
Okay, I see. Then why does this simple one not work?
Code:
def is_link_wanted(self, url, tag): return tag['id'] == 'jtoc_next' What do I do wrong there? Can't I use the id or something like that? |
10-30-2013, 12:05 PM | #9 |
creator of calibre
Posts: 43,778
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I have no idea, you will need to debug it yourself to see why it doesn't work. Put a print (tag) in is_link_wanted to see what the tag is.
|
10-30-2013, 05:47 PM | #10 |
Member
Posts: 18
Karma: 10
Join Date: Oct 2013
Device: Kindle
|
okay, this was crap. I made a few mistakes. I will look into it tomorrow...
sorry Last edited by lucis_lupinum; 10-30-2013 at 06:15 PM. Reason: ... |
10-31-2013, 06:50 AM | #11 |
Member
Posts: 18
Karma: 10
Join Date: Oct 2013
Device: Kindle
|
Okay, I tried a few things out and removed everything which possibly could remove these links or something like that.
In between it even worked, but I didn't change anything consciously (seems I am wrong) and now it doesn't work anymore... -.- The output link list of print(tag) does show many links, but there is not one with the id 'jtoc_next' which I tried to use. How can it be that this tag is just not in there?? Does remove_tags prevent the links/tags from showing up? Shouldn't it be in there at least when I don't use it entirely? Last edited by lucis_lupinum; 10-31-2013 at 07:26 AM. |
11-03-2013, 10:40 AM | #12 |
Member
Posts: 18
Karma: 10
Join Date: Oct 2013
Device: Kindle
|
workaround: print version
So I don't know what didn't work here, but I got a workaround for it.
Instead of trying out on and on I now use the print version of each article. There are no pictures in it, but that's not really a problem and a great advantage is that the downloaded content is much smaller and I also have to remove fewer tags from it, which simplifies the source code of the recipe a lot. If anyone is interested: Code:
def print_version(self, url): artid = url.rsplit('-')[-2] return u'http://www.golem.de/print.php?a=' +artid Thanks for your patience, Kovid |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
recipe for Golem.de - German | schuster | Recipes | 3 | 05-15-2011 11:33 AM |
Computer Tech news | jbcohen | Lounge | 10 | 03-24-2011 12:45 AM |
Other Fiction Meyrink, Gustav: Der Golem german v1 01 feb 2009 | netseeker | ePub Books | 0 | 02-01-2009 01:04 PM |
Other Fiction Meyrink, Gustav: Der Golem german v1 01 feb 2009 | netseeker | Kindle Books | 0 | 02-01-2009 01:02 PM |
Other Fiction Meyrink, Gustav: Der Golem german v1 01 feb 2009 | netseeker | BBeB/LRF Books | 0 | 02-01-2009 01:00 PM |