![]() |
#1 |
Enthusiast
![]() Posts: 45
Karma: 10
Join Date: Dec 2010
Device: Kindle 3 Wifi only
|
Can't extract article title in parse index
Hi All!
First of all please note, I'm not expert (python) programmer, and just started with Calibre recipes. I want to create a recipe for a static site. They does not provide RSS, so I had to use parse_index as I had figured out. I was able to extract the article links from: Code:
<!-- section title --> <a href="/publicisztika" class="rovat">Publicisztika</a> <div class="separator"></div> <ul> <li>KOVÁCS ZOLTÁN: <a href="/2010-12-08_vallunkra-helyezi-josagos-tenyeret">Vállunkra helyezi jóságos tenyerét</a></li> <li> Megyesi Gusztáv: <a href="/2010-12-08_vissza-a-partpenzt">Vissza a pártpénzt</a></li> <li>FALUSY ZSIGMOND: <a href="/2010-12-08_rgek">Ürgék</a></li> <!-- some more li cut out --> </ul> Code:
for post in section.findAll('li'): h = post.find('li') title = self.tag_to_string(h) self.log('\t * TITLE IS: ', title) a = post.find('a', href=True) url = a['href'] What I expect (or would like to get) is:
Could someone please help me out how to do that? Thanks in advance! |
![]() |
![]() |
![]() |
#2 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Your code first finds all <li> tags with the findall, then tries to find an <li> tag inside each li tag found. That's two <li> tags deep, but your sample seems to show only one level of li tags.
It looks to me like you want to find the <a> tag inside the li tag, then concatenate the text of the <li> tag with the text of the <a> tag inside it. |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Enthusiast
![]() Posts: 45
Karma: 10
Join Date: Dec 2010
Device: Kindle 3 Wifi only
|
Thanks for your answer I'll give it a try on Monday (@work machine).
And now it had worked. Thanks again! Last edited by hiperlink; 12-20-2010 at 09:54 AM. Reason: updated info |
![]() |
![]() |
![]() |
#4 |
Enthusiast
![]() Posts: 45
Karma: 10
Join Date: Dec 2010
Device: Kindle 3 Wifi only
|
Yet another problem with changed section types
So here goes my second issue:
The main site has sections like this: Code:
div class='fpdocument' div class='section' ul li -> a -> article1 li -> a ->article2 ... Code:
for section in soup.findAll('div', attrs={'class':'fpdocument'}): # processing section_title stripped, then finding articles articles = [] for post in section.findAll('li'): # processing articles stripped (but it just works(tm) Code:
div class='fpdocument' a class='section' a -> article1 end div Thanks in advance! |
![]() |
![]() |
![]() |
#5 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
|
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Enthusiast
![]() Posts: 45
Karma: 10
Join Date: Dec 2010
Device: Kindle 3 Wifi only
|
Here you go: http://www.es.hu/ .
My specific part: Code:
<div class="fpdocumentholder"> <div class="fpdocument" style="text-align:left;"> <a href="/publicisztika" class="rovat">Publicisztika</a> <div class="separator"></div> <ul> <li>RAJNAI ATTILA : <a href="/2010-12-15_foltok-a-munderon">Foltok a mundéron</a></li><li>BODOKY TAMÁS : <a href="/2010-12-15_a-grupo-milton-spanyol-modszere">A Grupo Milton spanyol módszere</a></li><li>TIMOTHY GARTON ASH: <a href="/2010-12-15_kovetsegi-taviratok-titokparade">Követségi táviratok: titokparádé</a></li><li>Kovács Zoltán: <a href="/2010-12-15_meg-mi-kene">Még mi kéne?</a></li><li>UNGVÁRY RUDOLF: <a href="/2010-12-15_nem-magyar-magyarkent">Nem magyar magyarként</a></li><li>LOSONCZ MIKLÓS: <a href="/2010-12-15_leminosites-">Leminősítés </a></li><li>MEGYESI GUSZTÁV: <a href="/2010-12-15_nullfaktor">Nullfaktor</a></li> </ul> </div> <div class="fpdocument"> <!-- <a href="?view=heading;4" class="rovat">Interjú</a> --> <a href="/interju" class="rovat">Interjú</a> <div class="separator"></div> <a href="/2010-12-15_zsidokrol" class="title">Zsidókról</a> <div class="author"></div> <div class="tovabb"><a href="/2010-12-15_zsidokrol">tovább</a> <img src="images/icon_tovabb.gif" align="absmiddle"/></div> <div class="clearfloat"></div> </div> .... Code:
li Code:
fpdocument Code:
class="rovat" Code:
class="title" Code:
a |
![]() |
![]() |
![]() |
#7 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
|
![]() |
![]() |
![]() |
#8 |
Enthusiast
![]() Posts: 45
Karma: 10
Join Date: Dec 2010
Device: Kindle 3 Wifi only
|
Oh, thank you. But will that work, if I want to get the section titles (class='rovat') as well?
Don't you mind to check my other question too? @ https://www.mobileread.com/forums/sho...d.php?t=112158 |
![]() |
![]() |
![]() |
#9 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Quote:
|
||
![]() |
![]() |
![]() |
#10 |
Enthusiast
![]() Posts: 45
Karma: 10
Join Date: Dec 2010
Device: Kindle 3 Wifi only
|
Once again: thank you, I'll try tomorrow.
For my other question: my problem is, that the extract() method removes the contents of the p tag, but the p tag still remains, and I want to get rid of the empty paragraph too. |
![]() |
![]() |
![]() |
#11 |
Enthusiast
![]() Posts: 45
Karma: 10
Join Date: Dec 2010
Device: Kindle 3 Wifi only
|
So now I'm making progress!
Two more question just came up:
|
![]() |
![]() |
![]() |
#12 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
extract() removes an entire tag. If you are removing only the contents, then you're probably removing a tag inside it, not the tag itself.
|
![]() |
![]() |
![]() |
#13 |
Enthusiast
![]() Posts: 45
Karma: 10
Join Date: Dec 2010
Device: Kindle 3 Wifi only
|
So I'm answering my own question now, looks like I'm learning
![]() ![]() To get rid off the navbar, I should have specified --output-profile kindle on the command line while using -ebook-convert (that is: it does not read the -.config files when you are using it from the CLI). So my only remaining issue is that my recipe fails to download one article as can be seen in the debug.log here: https://gist.github.com/749781. Recipe (which needs refactoring, I know): https://gist.github.com/749788 Missing article: http://www.es.hu/2010-12-15_kovetseg...ok-titokparade (it's recognized so my recipe sees this article, but can't download it): Code:
ould not fetch link http://www.es.hu/2010-12-15_kovetsegi-taviratok-titokparade Traceback (most recent call last): File "/usr/lib/calibre/calibre/web/fetch/simple.py", line 428, in process_links soup = self.get_soup(dsrc) File "/usr/lib/calibre/calibre/web/fetch/simple.py", line 189, in get_soup return self.preprocess_html_ext(soup) File "/tmp/calibre_0.7.34_tmp_WzGqsn/calibre_0.7.34_8gsQ4J_recipes/recipe0.py", line 122, in preprocess_html url = links['href'] File "/usr/lib/calibre/calibre/ebooks/BeautifulSoup.py", line 518, in __getitem__ return self._getAttrMap()[key] KeyError: 'href' http://www.es.hu/2010-12-15_kovetsegi-taviratok-titokparade saved to Downloading Fetching http://www.es.hu/2010-12-15_leminosites- Failed to download article: TIMOTHY GARTON ASH: Követségi táviratok: titokparádé from http://www.es.hu/2010-12-15_kovetsegi-taviratok-titokparade Traceback (most recent call last): File "/usr/lib/calibre/calibre/utils/threadpool.py", line 95, in run (request, request.callable(*request.args, **request.kwds)) File "/usr/lib/calibre/calibre/web/feeds/news.py", line 838, in fetch_article return self._fetch_article(url, dir, f, a, num_of_feeds) File "/usr/lib/calibre/calibre/web/feeds/news.py", line 834, in _fetch_article raise Exception(_('Could not fetch article. Run with -vv to see the reason')) Exception: Nem lehet a cikket letölteni. Futtassa a -vv paraméterrel a hibaüzenetek megjelenítéséhez 1% A következő cikk letöltése nem sikerült: u'TIMOTHY GARTON ASH: K\xf6vets\xe9gi t\xe1viratok: titokpar\xe1d\xe9' I did try to get it via web2disk, and it was downloaded. So why my recipe can't get it? Thanks for any help! |
![]() |
![]() |
![]() |
#14 |
Enthusiast
![]() Posts: 45
Karma: 10
Join Date: Dec 2010
Device: Kindle 3 Wifi only
|
![]()
So I still can't get rid off the download errors. Recipe is the same as above, except I added:
Code:
class EletEsIrodalom(BasicNewsRecipe): encoding = 'utf-8' delay = 10 simultaneous_downloadsi = 1 timeout = 30 Code:
Could not fetch link http://www.es.hu/2010-12-21_a-tanacs...et-jelentese1- Traceback (most recent call last): File "/usr/lib/calibre/calibre/web/fetch/simple.py", line 428, in process_links soup = self.get_soup(dsrc) File "/usr/lib/calibre/calibre/web/fetch/simple.py", line 189, in get_soup return self.preprocess_html_ext(soup) File "/tmp/calibre_0.7.35_tmp_kysmc6/calibre_0.7.35_IzD6Uk_recipes/recipe0.py", line 129, in preprocess_html url = links['href'] File "/usr/lib/calibre/calibre/ebooks/BeautifulSoup.py", line 518, in __getitem__ return self._getAttrMap()[key] KeyError: 'href' Failed to download article: A tan�csad� test�let jelent�se1 from http://www.es.hu/2010-12-21_a-tanacs...et-jelentese1- Traceback (most recent call last): File "/usr/lib/calibre/calibre/utils/threadpool.py", line 95, in run (request, request.callable(*request.args, **request.kwds)) File "/usr/lib/calibre/calibre/web/feeds/news.py", line 839, in fetch_article return self._fetch_article(url, dir, f, a, num_of_feeds) File "/usr/lib/calibre/calibre/web/feeds/news.py", line 835, in _fetch_article raise Exception(_('Could not fetch article. Run with -vv to see the reason')) Exception: Nem lehet a cikket letölteni. Futtassa a -vv paraméterrel a hibaüzenetek megjelenítéséhez ... Could not fetch link http://www.es.hu/2010-12-21_a-demokr...cio-feltetelei Traceback (most recent call last): File "/usr/lib/calibre/calibre/web/fetch/simple.py", line 428, in process_links soup = self.get_soup(dsrc) File "/usr/lib/calibre/calibre/web/fetch/simple.py", line 189, in get_soup return self.preprocess_html_ext(soup) File "/tmp/calibre_0.7.35_tmp_kysmc6/calibre_0.7.35_IzD6Uk_recipes/recipe0.py", line 129, in preprocess_html url = links['href'] File "/usr/lib/calibre/calibre/ebooks/BeautifulSoup.py", line 518, in __getitem__ return self._getAttrMap()[key] KeyError: 'href' Failed to download article: MIKL�SI ZOLT�N A demokratikus korrekci� felt�telei from http://www.es.hu/2010-12-21_a-demokr...cio-feltetelei Traceback (most recent call last): File "/usr/lib/calibre/calibre/utils/threadpool.py", line 95, in run (request, request.callable(*request.args, **request.kwds)) File "/usr/lib/calibre/calibre/web/feeds/news.py", line 839, in fetch_article return self._fetch_article(url, dir, f, a, num_of_feeds) File "/usr/lib/calibre/calibre/web/feeds/news.py", line 835, in _fetch_article raise Exception(_('Could not fetch article. Run with -vv to see the reason')) Exception: Nem lehet a cikket letölteni. Futtassa a -vv paraméterrel a hibaüzenetek megjelenítéséhez ![]() |
![]() |
![]() |
![]() |
#15 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,597
Karma: 28548962
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
look for the line
url = links['href'] in your recipe and surround it by a try: except: block to catchthe erorr and ignore the article |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
ADD Books & extract tags from title? | johnb0647 | Calibre | 3 | 01-08-2011 05:36 PM |
Article tweak for title sort not working | Manichean | Calibre | 2 | 10-04-2010 11:56 AM |
Initial parse failed: | mburgoa | Calibre | 4 | 08-07-2010 08:50 AM |
Metadata extract from Title | 507Tuli | Calibre | 14 | 05-29-2009 03:13 AM |