Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 10-28-2013, 06:44 PM   #1
lucis_lupinum
Member
lucis_lupinum began at the beginning.
 
Posts: 18
Karma: 10
Join Date: Oct 2013
Device: Kindle
Golem.de (german tech news) multipage article

Hi,

I tried to add a method to fetch multipage articles (like this: Golem.de article RSS-feed 'Hardware') like in the 'Adventuer Gamers' recipe.

I modified the code like this to fit to the golem homepage:

Spoiler:
Code:
def append_page(self, soup, appendtag, position):
      pager = soup.find('ol', attrs={'class':'list_pages'}) #class which contains the links
                                                           #to the other pages of the article
      if pager:
         nextpage = soup.find('a', attrs={'class':'icon-rsaquo'}) #next-page element
         if nextpage:
             nexturl = nextpage['href']
             soup2 = self.index_to_soup(nexturl)
             texttag = soup2.find('div', attrs={'class':'formatted'}) #the article text is in this
             for it in texttag.findAll(style=True):
                 del it['style']
             newpos = len(texttag.contents)
             self.append_page(soup2,texttag,newpos)
             texttag.extract()
             pager.extract()
             appendtag.insert(position,texttag)


  def preprocess_html(self, soup):
      for item in soup.findAll(style=True):
          del item['style']
      for item in soup.findAll('div', attrs={'class':'floatright'}):
          item.extract()
      self.append_page(soup, soup.body, 3)
      pager = soup.find('ol',attrs={'class':'list_pages'})
      if pager:
         pager.extract()
      return self.adeify_images(soup)


The problem is, I don't know what I have to insert here:
Code:
...
          del item['style']
      for item in soup.findAll('div', attrs={'class':'floatright'}):
          item.extract()
      self.append_page(soup, soup.body, 3)
...
and if I have to change more things maybe...

In this form it does nothing but adding the page numbers to the end of the article...
It is probably pretty simple, but I don't know how to fix it...


Can anybody help me?
Thanks!
lucis_lupinum is offline   Reply With Quote
Old 10-29-2013, 12:27 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 26,433
Karma: 5383257
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
That's the hard way of doing things,instead just implement the is_link_wanted function in your recipe to return True for links calibre should follow and False otherwise.
kovidgoyal is offline   Reply With Quote
Old 10-29-2013, 04:10 AM   #3
lucis_lupinum
Member
lucis_lupinum began at the beginning.
 
Posts: 18
Karma: 10
Join Date: Oct 2013
Device: Kindle
Oh okay, I didn't know that.

I will try out this later.
Thanks!
lucis_lupinum is offline   Reply With Quote
Old 10-29-2013, 07:28 AM   #4
lucis_lupinum
Member
lucis_lupinum began at the beginning.
 
Posts: 18
Karma: 10
Join Date: Oct 2013
Device: Kindle
Okay, I just tried to insert the function, but I am not really sure, how to pass the arguments to the function and exactly what to pass and so on.

I tried it like this:
Code:
def is_link_wanted (self, url, tag):
    tag = dict(name='a', attrs={'id':'jtoc-next'})
    if tag:
          return True
    else:
          return False
The code on the website looks like this:
Spoiler:
Code:
<ol id="list-jtoc" class="list-pages" style="display: block;">
<li><strong>1</strong></li>
<li><a id="jtoc_2" href="http://www.golem.de/news/playjams-gamestick-im-test-die-android-konsole-fuer-zwischendurch-1310-102324-2.html">2</a></li>
...
<li><a id="jtoc_next" class="icon-rsaquo" href="http://www.golem.de/news/playjams-gamestick-im-test-die-android-konsole-fuer-zwischendurch-1310-102324-2.html">&nbsp;</a></li>
</ol>

The bold element is the button for the next page, so the important part.
Sadly it doesn't work. So how should it look like? Do I have to insert the url also? But it will look different for every article...
I'm a bit uncertain now...

But thanks anyway
lucis_lupinum is offline   Reply With Quote
Old 10-29-2013, 09:55 AM   #5
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 26,433
Karma: 5383257
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You need to check the value of url and if it is a url you want followed, return True otherwise return False
kovidgoyal is offline   Reply With Quote
Old 10-30-2013, 09:13 AM   #6
lucis_lupinum
Member
lucis_lupinum began at the beginning.
 
Posts: 18
Karma: 10
Join Date: Oct 2013
Device: Kindle
Yeah that's clear to me, but how should I identify the urls I want? They will be different for every article. The one thing that is always the same, is just the element they are in..
I don't get it - sorry
I looked into other recipes where this method is used, but they were all different and I didn't really know how to use transfer them :-S
And: should I return 'True' or another value, because in most of the mentioned recipes something different is returned...:

Ciekawostki Historyczne:
Code:
def is_link_wanted(self, url, tag):
        return 'ciekawostkihistoryczne' in url and url[-2] in {'2', '3', '4', '5', '6'}
Forbes:
Code:
    def is_link_wanted(self, url, tag):
        ans = re.match(r'http://.*/[2-9]/', url) is not None
        if ans:
            self.log('Following multipage link: %s'%url)
        return ans
hackernews:
Code:
    def is_link_wanted(self, url, tag):
        if url.endswith('.pdf'):
            return False
        return True
and the one for Kopalnia Wiedzy:
Code:
    def is_link_wanted(self, url, tag):
        return tag['class'] == 'next'
lucis_lupinum is offline   Reply With Quote
Old 10-30-2013, 10:16 AM   #7
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 26,433
Karma: 5383257
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Those are all returning True or False. The various expressions evaluate to True or False.

If you want to chack the element instead of the url, use the tag parameter, that is the element from which the url comes.
kovidgoyal is offline   Reply With Quote
Old 10-30-2013, 01:04 PM   #8
lucis_lupinum
Member
lucis_lupinum began at the beginning.
 
Posts: 18
Karma: 10
Join Date: Oct 2013
Device: Kindle
Okay, I see. Then why does this simple one not work?

Code:
def is_link_wanted(self, url, tag):
        return tag['id'] == 'jtoc_next'
I want to adress the element by its id which is jtoc_next

What do I do wrong there? Can't I use the id or something like that?
lucis_lupinum is offline   Reply With Quote
Old 10-30-2013, 01:05 PM   #9
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 26,433
Karma: 5383257
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I have no idea, you will need to debug it yourself to see why it doesn't work. Put a print (tag) in is_link_wanted to see what the tag is.
kovidgoyal is offline   Reply With Quote
Old 10-30-2013, 06:47 PM   #10
lucis_lupinum
Member
lucis_lupinum began at the beginning.
 
Posts: 18
Karma: 10
Join Date: Oct 2013
Device: Kindle
okay, this was crap. I made a few mistakes. I will look into it tomorrow...
sorry

Last edited by lucis_lupinum; 10-30-2013 at 07:15 PM. Reason: ...
lucis_lupinum is offline   Reply With Quote
Old 10-31-2013, 07:50 AM   #11
lucis_lupinum
Member
lucis_lupinum began at the beginning.
 
Posts: 18
Karma: 10
Join Date: Oct 2013
Device: Kindle
Okay, I tried a few things out and removed everything which possibly could remove these links or something like that.
In between it even worked, but I didn't change anything consciously (seems I am wrong) and now it doesn't work anymore... -.-

The output link list of print(tag) does show many links, but there is not one with the id 'jtoc_next' which I tried to use. How can it be that this tag is just not in there??
Does remove_tags prevent the links/tags from showing up? Shouldn't it be in there at least when I don't use it entirely?

Last edited by lucis_lupinum; 10-31-2013 at 08:26 AM.
lucis_lupinum is offline   Reply With Quote
Old 11-03-2013, 11:40 AM   #12
lucis_lupinum
Member
lucis_lupinum began at the beginning.
 
Posts: 18
Karma: 10
Join Date: Oct 2013
Device: Kindle
Smile workaround: print version

So I don't know what didn't work here, but I got a workaround for it.
Instead of trying out on and on I now use the print version of each article. There are no pictures in it, but that's not really a problem and a great advantage is that the downloaded content is much smaller and I also have to remove fewer tags from it, which simplifies the source code of the recipe a lot.

If anyone is interested:
Code:
def print_version(self, url):
      artid = url.rsplit('-')[-2]
      return u'http://www.golem.de/print.php?a=' +artid
I had to extract the article ID from the URL which is the last number group and just add this to the given print url

Thanks for your patience, Kovid
lucis_lupinum is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
recipe for Golem.de - German schuster Recipes 3 05-15-2011 12:33 PM
Computer Tech news jbcohen Lounge 10 03-24-2011 01:45 AM
Other Fiction Meyrink, Gustav: Der Golem german v1 01 feb 2009 netseeker ePub Books 0 02-01-2009 02:04 PM
Other Fiction Meyrink, Gustav: Der Golem german v1 01 feb 2009 netseeker Kindle Books 0 02-01-2009 02:02 PM
Other Fiction Meyrink, Gustav: Der Golem german v1 01 feb 2009 netseeker BBeB/LRF Books 0 02-01-2009 02:00 PM


All times are GMT -4. The time now is 11:55 AM.


MobileRead.com is a privately owned, operated and funded community.