Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 12-17-2010, 10:09 AM   #1
hiperlink
Enthusiast
hiperlink began at the beginning.
 
Posts: 37
Karma: 10
Join Date: Dec 2010
Device: Kindle 3 Wifi only
Can't extract article title in parse index

Hi All!

First of all please note, I'm not expert (python) programmer, and just started with Calibre recipes.

I want to create a recipe for a static site.

They does not provide RSS, so I had to use parse_index as I had figured out.

I was able to extract the article links from:

Code:
<!-- section title -->
<a href="/publicisztika" class="rovat">Publicisztika</a>

          <div class="separator"></div>
          <ul>
<li>KOVÁCS ZOLTÁN: <a href="/2010-12-08_vallunkra-helyezi-josagos-tenyeret">Vállunkra helyezi jóságos tenyerét</a></li>

<li> Megyesi Gusztáv: <a href="/2010-12-08_vissza-a-partpenzt">Vissza a pártpénzt</a></li>

<li>FALUSY ZSIGMOND: <a href="/2010-12-08_rgek">Ürgék</a></li>
<!-- some more li cut out -->
</ul>
Via this code:

Code:
            for post in section.findAll('li'):
                h = post.find('li')
                title = self.tag_to_string(h)
                self.log('\t * TITLE IS: ', title)
                a = post.find('a', href=True)
                url = a['href']
But for some reason the title is never set.
What I expect (or would like to get) is:
  • title: "KOVÁCS ZOLTÁN: Vállunkra helyezi jóságos tenyerét"
  • a: "/2010-12-08_vallunkra-helyezi-josagos-tenyeret"
  • title: "Megyesi Gusztáv: Vissza a pártpénzt"
  • a: "/2010-12-08_vissza-a-partpenzt"

Could someone please help me out how to do that?

Thanks in advance!
hiperlink is offline   Reply With Quote
Old 12-17-2010, 10:57 AM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by hiperlink View Post
Could someone please help me out how to do that?
Your code first finds all <li> tags with the findall, then tries to find an <li> tag inside each li tag found. That's two <li> tags deep, but your sample seems to show only one level of li tags.

It looks to me like you want to find the <a> tag inside the li tag, then concatenate the text of the <li> tag with the text of the <a> tag inside it.
Starson17 is offline   Reply With Quote
 
Enthusiast
Old 12-18-2010, 11:18 AM   #3
hiperlink
Enthusiast
hiperlink began at the beginning.
 
Posts: 37
Karma: 10
Join Date: Dec 2010
Device: Kindle 3 Wifi only
Thanks for your answer I'll give it a try on Monday (@work machine).

And now it had worked. Thanks again!

Last edited by hiperlink; 12-20-2010 at 09:54 AM. Reason: updated info
hiperlink is offline   Reply With Quote
Old 12-20-2010, 12:04 PM   #4
hiperlink
Enthusiast
hiperlink began at the beginning.
 
Posts: 37
Karma: 10
Join Date: Dec 2010
Device: Kindle 3 Wifi only
Yet another problem with changed section types

So here goes my second issue:

The main site has sections like this:
Code:
div class='fpdocument'
     div class='section'
          ul
              li -> a -> article1
              li -> a ->article2
...
which I was able to extract via
Code:
for section in soup.findAll('div', attrs={'class':'fpdocument'}):
  # processing section_title stripped, then finding articles
  articles = []
  for post in section.findAll('li'):
    # processing articles stripped (but it just works(tm)
But now I recognized, that some section(s) has only one article, and in that case the structure is:
Code:
div class='fpdocument'
          a class='section'
          a -> article1
end div
How to extract those articles?
Thanks in advance!
hiperlink is offline   Reply With Quote
Old 12-20-2010, 03:23 PM   #5
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by hiperlink View Post
How to extract those articles?
I'd need to see the html.
Starson17 is offline   Reply With Quote
Old 12-20-2010, 04:05 PM   #6
hiperlink
Enthusiast
hiperlink began at the beginning.
 
Posts: 37
Karma: 10
Join Date: Dec 2010
Device: Kindle 3 Wifi only
Here you go: http://www.es.hu/ .

My specific part:
Code:
<div class="fpdocumentholder">
  <div class="fpdocument" style="text-align:left;">
    <a href="/publicisztika" class="rovat">Publicisztika</a>
      <div class="separator"></div>
        <ul>
<li>RAJNAI ATTILA : <a href="/2010-12-15_foltok-a-munderon">Foltok a mundéron</a></li><li>BODOKY TAMÁS : <a href="/2010-12-15_a-grupo-milton-spanyol-modszere">A Grupo Milton spanyol módszere</a></li><li>TIMOTHY GARTON ASH: <a href="/2010-12-15_kovetsegi-taviratok-titokparade">Követségi táviratok: titokparádé</a></li><li>Kovács Zoltán: <a href="/2010-12-15_meg-mi-kene">Még mi kéne?</a></li><li>UNGVÁRY RUDOLF: <a href="/2010-12-15_nem-magyar-magyarkent">Nem magyar magyarként</a></li><li>LOSONCZ MIKLÓS: <a href="/2010-12-15_leminosites-">Leminősítés </a></li><li>MEGYESI GUSZTÁV: <a href="/2010-12-15_nullfaktor">Nullfaktor</a></li>
          </ul>
        </div>
       <div class="fpdocument">
       <!-- <a href="?view=heading;4" class="rovat">Interjú</a> -->
          <a href="/interju" class="rovat">Interjú</a> 
          <div class="separator"></div>
          <a href="/2010-12-15_zsidokrol" class="title">Zsidókról</a>
                    <div class="author"></div>
                    <div class="tovabb"><a href="/2010-12-15_zsidokrol">tovább</a> <img src="images/icon_tovabb.gif" align="absmiddle"/></div>
                    <div class="clearfloat"></div>
        </div>
....
What I need is the
Code:
li
items (and I already got them), and from the next
Code:
fpdocument
, the section is defined by the
Code:
class="rovat"
, and the only article is defined by
Code:
class="title"
in the
Code:
a
.
hiperlink is offline   Reply With Quote
Old 12-20-2010, 04:32 PM   #7
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by hiperlink View Post
Here you go:
So what's the question?

Find the div tag that has class="fpdocument".

Find the a tag within the above div that has class="title"

etc.
Starson17 is offline   Reply With Quote
Old 12-20-2010, 04:39 PM   #8
hiperlink
Enthusiast
hiperlink began at the beginning.
 
Posts: 37
Karma: 10
Join Date: Dec 2010
Device: Kindle 3 Wifi only
Oh, thank you. But will that work, if I want to get the section titles (class='rovat') as well?

Don't you mind to check my other question too? @ http://www.mobileread.com/forums/sho...d.php?t=112158
hiperlink is offline   Reply With Quote
Old 12-20-2010, 04:49 PM   #9
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by hiperlink View Post
Oh, thank you. But will that work, if I want to get the section titles (class='rovat') as well?
If you want the section title, I don't see any problem getting it as well. You seem to know how to extract stuff, so I'm not seeing where you are having trouble. Perhaps if you post the code you are using and the problem, it will be apparent. I'm not sure if you're having trouble with interaction between the code for part 1 and the code for part 2, or if it's just figuring out how to write code for part 2... or what? You should be using print statements to track down any problems you're having.

Quote:
Don't you mind to check my other question too? @ http://www.mobileread.com/forums/sho...d.php?t=112158
I read the question, but didn't understand it.
Starson17 is offline   Reply With Quote
Old 12-20-2010, 05:08 PM   #10
hiperlink
Enthusiast
hiperlink began at the beginning.
 
Posts: 37
Karma: 10
Join Date: Dec 2010
Device: Kindle 3 Wifi only
Once again: thank you, I'll try tomorrow.

For my other question: my problem is, that the extract() method removes the contents of the p tag, but the p tag still remains, and I want to get rid of the empty paragraph too.
hiperlink is offline   Reply With Quote
Old 12-21-2010, 06:08 AM   #11
hiperlink
Enthusiast
hiperlink began at the beginning.
 
Posts: 37
Karma: 10
Join Date: Dec 2010
Device: Kindle 3 Wifi only
So now I'm making progress!

Two more question just came up:
  1. I posted the debug log to my recipe at here, and one of the articles was not dowloaded ( Failed to download article: TIMOTHY GARTON ASH: Követségi táviratok: titokparádé from http://www.es.hu/2010-12-15_kovetseg...ok-titokparade ), but I can't get it why? Here is the recipe. (Yeah I do know it's a mess, and I'm going to refactor it - but don't forget I'm not a programmer, using python for the first time...)
  2. And I had looked into getting rid of the navbar in the mobi output and found these threads 1, 2. And I had set up the great Calibre as described there (output format mobi, output profile: kindle), but I had no success. The navbar is still there. And I don't want it. How can I remove it? I had converted the recipe to mobi via: ebook-convert nolcalibre.recipe es.mobi -vv | tee debug.log.
Thanks in advance!
hiperlink is offline   Reply With Quote
Old 12-21-2010, 07:50 AM   #12
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by hiperlink View Post
Once again: thank you, I'll try tomorrow.

For my other question: my problem is, that the extract() method removes the contents of the p tag, but the p tag still remains, and I want to get rid of the empty paragraph too.
extract() removes an entire tag. If you are removing only the contents, then you're probably removing a tag inside it, not the tag itself.
Starson17 is offline   Reply With Quote
Old 12-22-2010, 06:47 AM   #13
hiperlink
Enthusiast
hiperlink began at the beginning.
 
Posts: 37
Karma: 10
Join Date: Dec 2010
Device: Kindle 3 Wifi only
So I'm answering my own question now, looks like I'm learning

To get rid off the navbar, I should have specified --output-profile kindle on the command line while using -ebook-convert (that is: it does not read the -.config files when you are using it from the CLI).

So my only remaining issue is that my recipe fails to download one article as can be seen in the debug.log here: https://gist.github.com/749781.

Recipe (which needs refactoring, I know): https://gist.github.com/749788

Missing article: http://www.es.hu/2010-12-15_kovetseg...ok-titokparade
(it's recognized so my recipe sees this article, but can't download it):

Code:
ould not fetch link http://www.es.hu/2010-12-15_kovetsegi-taviratok-titokparade
Traceback (most recent call last):
  File "/usr/lib/calibre/calibre/web/fetch/simple.py", line 428, in process_links
    soup = self.get_soup(dsrc)
  File "/usr/lib/calibre/calibre/web/fetch/simple.py", line 189, in get_soup
    return self.preprocess_html_ext(soup)
  File "/tmp/calibre_0.7.34_tmp_WzGqsn/calibre_0.7.34_8gsQ4J_recipes/recipe0.py", line 122, in preprocess_html
    url = links['href']
  File "/usr/lib/calibre/calibre/ebooks/BeautifulSoup.py", line 518, in __getitem__
    return self._getAttrMap()[key]
KeyError: 'href'

http://www.es.hu/2010-12-15_kovetsegi-taviratok-titokparade saved to 
Downloading
Fetching http://www.es.hu/2010-12-15_leminosites-
Failed to download article: TIMOTHY GARTON ASH: Követségi táviratok: titokparádé from http://www.es.hu/2010-12-15_kovetsegi-taviratok-titokparade
Traceback (most recent call last):
  File "/usr/lib/calibre/calibre/utils/threadpool.py", line 95, in run
    (request, request.callable(*request.args, **request.kwds))
  File "/usr/lib/calibre/calibre/web/feeds/news.py", line 838, in fetch_article
    return self._fetch_article(url, dir, f, a, num_of_feeds)
  File "/usr/lib/calibre/calibre/web/feeds/news.py", line 834, in _fetch_article
    raise Exception(_('Could not fetch article. Run with -vv to see the reason'))
Exception: Nem lehet a cikket letölteni. Futtassa a -vv paraméterrel a hibaüzenetek megjelenítéséhez



1% A következő cikk letöltése nem sikerült: u'TIMOTHY GARTON ASH: K\xf6vets\xe9gi t\xe1viratok: titokpar\xe1d\xe9'
And I need that article too.

I did try to get it via web2disk, and it was downloaded. So why my recipe can't get it?

Thanks for any help!
hiperlink is offline   Reply With Quote
Old 01-03-2011, 08:06 AM   #14
hiperlink
Enthusiast
hiperlink began at the beginning.
 
Posts: 37
Karma: 10
Join Date: Dec 2010
Device: Kindle 3 Wifi only
Unhappy Still download errors...

So I still can't get rid off the download errors. Recipe is the same as above, except I added:

Code:
class EletEsIrodalom(BasicNewsRecipe):
 
    encoding = 'utf-8'
    delay = 10
    simultaneous_downloadsi = 1
    timeout = 30
But it did not help... now I miss two articles (from the debug.log):

Code:
Could not fetch link http://www.es.hu/2010-12-21_a-tanacs...et-jelentese1-
Traceback (most recent call last):
  File "/usr/lib/calibre/calibre/web/fetch/simple.py", line 428, in process_links
    soup = self.get_soup(dsrc)
  File "/usr/lib/calibre/calibre/web/fetch/simple.py", line 189, in get_soup
    return self.preprocess_html_ext(soup)
  File "/tmp/calibre_0.7.35_tmp_kysmc6/calibre_0.7.35_IzD6Uk_recipes/recipe0.py", line 129, in preprocess_html
    url = links['href']
  File "/usr/lib/calibre/calibre/ebooks/BeautifulSoup.py", line 518, in __getitem__
    return self._getAttrMap()[key]
KeyError: 'href'

Failed to download article: A tan�csad� test�let jelent�se1 from http://www.es.hu/2010-12-21_a-tanacs...et-jelentese1-
Traceback (most recent call last):
  File "/usr/lib/calibre/calibre/utils/threadpool.py", line 95, in run
    (request, request.callable(*request.args, **request.kwds))
  File "/usr/lib/calibre/calibre/web/feeds/news.py", line 839, in fetch_article
    return self._fetch_article(url, dir, f, a, num_of_feeds)
  File "/usr/lib/calibre/calibre/web/feeds/news.py", line 835, in _fetch_article
    raise Exception(_('Could not fetch article. Run with -vv to see the reason'))
Exception: Nem lehet a cikket letölteni. Futtassa a -vv paraméterrel a hibaüzenetek megjelenítéséhez
...


Could not fetch link http://www.es.hu/2010-12-21_a-demokr...cio-feltetelei
Traceback (most recent call last):
  File "/usr/lib/calibre/calibre/web/fetch/simple.py", line 428, in process_links
    soup = self.get_soup(dsrc)
  File "/usr/lib/calibre/calibre/web/fetch/simple.py", line 189, in get_soup
    return self.preprocess_html_ext(soup)
  File "/tmp/calibre_0.7.35_tmp_kysmc6/calibre_0.7.35_IzD6Uk_recipes/recipe0.py", line 129, in preprocess_html
    url = links['href']
  File "/usr/lib/calibre/calibre/ebooks/BeautifulSoup.py", line 518, in __getitem__
    return self._getAttrMap()[key]
KeyError: 'href'

Failed to download article: MIKL�SI ZOLT�N  A demokratikus korrekci� felt�telei from http://www.es.hu/2010-12-21_a-demokr...cio-feltetelei
Traceback (most recent call last):
  File "/usr/lib/calibre/calibre/utils/threadpool.py", line 95, in run
    (request, request.callable(*request.args, **request.kwds))
  File "/usr/lib/calibre/calibre/web/feeds/news.py", line 839, in fetch_article
    return self._fetch_article(url, dir, f, a, num_of_feeds)
  File "/usr/lib/calibre/calibre/web/feeds/news.py", line 835, in _fetch_article
    raise Exception(_('Could not fetch article. Run with -vv to see the reason'))
Exception: Nem lehet a cikket letölteni. Futtassa a -vv paraméterrel a hibaüzenetek megjelenítéséhez
Thanks for any help!
hiperlink is offline   Reply With Quote
Old 01-03-2011, 07:17 PM   #15
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 25,376
Karma: 4961459
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
look for the line

url = links['href']


in your recipe and surround it by a try: except: block to catchthe erorr and ignore the article
kovidgoyal is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
ADD Books & extract tags from title? johnb0647 Calibre 3 01-08-2011 05:36 PM
Article tweak for title sort not working Manichean Calibre 2 10-04-2010 11:56 AM
Initial parse failed: mburgoa Calibre 4 08-07-2010 08:50 AM
Metadata extract from Title 507Tuli Calibre 14 05-29-2009 03:13 AM


All times are GMT -4. The time now is 06:57 AM.


MobileRead.com is a privately owned, operated and funded community.