Can't extract article title in parse index - Page 2

hiperlink · 01-04-2011, 03:45 AM

Thanks for your Answer Kovid!

But what if I want to get the article? Why can't my recipe download it?

kovidgoyal · 01-04-2011, 11:20 AM

you have look and see why the link element has no href on the website and figure out an alternative

hiperlink · 01-04-2011, 03:52 PM

I can't get it.

I mean, in the debug.log (for the previous version of the scrapped site): https://gist.github.com/749781

Here is one section with its articles, as shown in the log:

Quote:

Found section: Publicisztika
Found article: RAJNAI ATTILA : Foltok a mundéron at http://www.es.hu/2010-12-15_foltok-a-munderon
Found article: BODOKY TAMÁS : A Grupo Milton spanyol módszere at http://www.es.hu/2010-12-15_a-grupo-...anyol-modszere
Found article: TIMOTHY GARTON ASH: Követségi táviratok: titokparádé at http://www.es.hu/2010-12-15_kovetseg...ok-titokparade
Found article: Kovács Zoltán: Még mi kéne? at http://www.es.hu/2010-12-15_meg-mi-kene
Found article: UNGVÁRY RUDOLF: Nem magyar magyarként at http://www.es.hu/2010-12-15_nem-magyar-magyarkent
Found article: LOSONCZ MIKLÓS: Leminősítés at http://www.es.hu/2010-12-15_leminosites-
Found article: MEGYESI GUSZTÁV: Nullfaktor at http://www.es.hu/2010-12-15_nullfaktor

And later in the log:

Quote:

Could not fetch link http://www.es.hu/2010-12-15_kovetseg...ok-titokparade
Traceback (most recent call last):
File "/usr/lib/calibre/calibre/web/fetch/simple.py", line 428, in process_links
soup = self.get_soup(dsrc)
File "/usr/lib/calibre/calibre/web/fetch/simple.py", line 189, in get_soup
return self.preprocess_html_ext(soup)
File "/tmp/calibre_0.7.34_tmp_WzGqsn/calibre_0.7.34_8gsQ4J_recipes/recipe0.py", line 122, in preprocess_html
url = links['href']
File "/usr/lib/calibre/calibre/ebooks/BeautifulSoup.py", line 518, in __getitem__
return self._getAttrMap()[key]
KeyError: 'href'

http://www.es.hu/2010-12-15_kovetseg...ok-titokparade saved to
Downloading
Fetching http://www.es.hu/2010-12-15_leminosites-
Failed to download article: TIMOTHY GARTON ASH: Követségi táviratok: titokparádé from http://www.es.hu/2010-12-15_kovetseg...ok-titokparade
Traceback (most recent call last):
File "/usr/lib/calibre/calibre/utils/threadpool.py", line 95, in run
(request, request.callable(*request.args, **request.kwds))
File "/usr/lib/calibre/calibre/web/feeds/news.py", line 838, in fetch_article
return self._fetch_article(url, dir, f, a, num_of_feeds)
File "/usr/lib/calibre/calibre/web/feeds/news.py", line 834, in _fetch_article
raise Exception(_('Could not fetch article. Run with -vv to see the reason'))
Exception: Nem lehet a cikket letölteni. Futtassa a -vv paraméterrel a hibaüzenetek megjelenítéséhez

Which means I get the article href in parse_index part, but can't download it in preprocess_html (as this function contains: url = links['href'])?

hiperlink · 01-18-2011, 08:31 AM

Hi All,

With my updated recipe (which still needs refactoring) at https://gist.github.com/749788 I still can't get some of the articles which were recognized by parse_index as valid feed items (and can access them via my browser). Could someone tell me why?

Here is the debug.log:
https://gist.github.com/749781

Important part is:

Code:

Could not fetch link http://www.es.hu/2011-01-16_van-e-sajtoszabadsag-magyarorszagon
Traceback (most recent call last):
  File "/usr/lib/calibre/calibre/web/fetch/simple.py", line 428, in process_links
    soup = self.get_soup(dsrc)
  File "/usr/lib/calibre/calibre/web/fetch/simple.py", line 189, in get_soup
    return self.preprocess_html_ext(soup)
  File "/tmp/calibre_0.7.40_tmp_fNd0OI/calibre_0.7.40_CGdmix_recipes/recipe0.py", line 144, in preprocess_html
    url = links['href']
  File "/usr/lib/calibre/calibre/ebooks/BeautifulSoup.py", line 518, in __getitem__
    return self._getAttrMap()[key]
KeyError: 'href'

http://www.es.hu/2011-01-16_van-e-sajtoszabadsag-magyarorszagon saved to 
Downloading
Fetching http://www.es.hu/2011-01-16_esse-delendam
Failed to download article: KOLTAY ANDRÁS  Van-e sajtószabadság Magyarországon? from http://www.es.hu/2011-01-16_van-e-sajtoszabadsag-magyarorszagon
Traceback (most recent call last):
  File "/usr/lib/calibre/calibre/utils/threadpool.py", line 95, in run
    (request, request.callable(*request.args, **request.kwds))
  File "/usr/lib/calibre/calibre/web/feeds/news.py", line 846, in fetch_article
    return self._fetch_article(url, dir, f, a, num_of_feeds)
  File "/usr/lib/calibre/calibre/web/feeds/news.py", line 842, in _fetch_article
    raise Exception(_('Could not fetch article. Run with -vv to see the reason'))
Exception: Nem lehet a cikket letölteni. Futtassa a -vv paraméterrel a hibaüzenetek megjelenítéséhez

Starson17 · 01-18-2011, 11:00 AM

Quote:

Originally Posted by hiperlink

I still can't get some of the articles which were recognized by parse_index as valid feed items (and can access them via my browser). Could someone tell me why?

I can't, but I can steer you to some debugging. When Calibre asks for an article, the way it asks differs from the request made by a browser. The trick is to make the browser look like Calibre or vice-versa. It can be a cookie issue, a header issue (referer, etc.). Use LiveHTTP Headers or TamperData in FireFox to control the browser. Use the browser and header commands in the recipe to see and modify headers/cookies/referer in your recipe. When they are the same, you will get the same results.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
ADD Books & extract tags from title?	johnb0647	Calibre	3	01-08-2011 05:36 PM
Article tweak for title sort not working	Manichean	Calibre	2	10-04-2010 11:56 AM
Initial parse failed:	mburgoa	Calibre	4	08-07-2010 08:50 AM
Metadata extract from Title	507Tuli	Calibre	14	05-29-2009 03:13 AM

01-04-2011, 03:45 AM	#16
hiperlink Enthusiast Posts: 45 Karma: 10 Join Date: Dec 2010 Device: Kindle 3 Wifi only	Thanks for your Answer Kovid! But what if I want to get the article? Why can't my recipe download it?

01-04-2011, 11:20 AM	#17
kovidgoyal creator of calibre Posts: 43,866 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	you have look and see why the link element has no href on the website and figure out an alternative

Advert

Advert