![]() |
#1 |
Zealot
![]() ![]() Posts: 119
Karma: 100
Join Date: Jan 2011
Location: Germany / NRW /Köln
Device: prs-650 / prs-350 /kindle 3
|
it work's, but the toc isn't right
hi,
my problem today (i'm in lerning process with this stuff) this recipe work's but in the mobi-book it shows me not the right things. normally it is: main1 --first article --second article ect. main2 --first article --second article ect. but it is: unknown feed first article unknown feed second article after a few hours of testing and trying i don't know the way Code:
class AdvancedUserRecipe(BasicNewsRecipe): title = 'National_Geo_test_6' description = '111beschreibung111' __author__ = 'irgendwer' publisher = 'jaja' language = 'de' oldest_article = 2 max_articles_per_feed = 35 no_stylesheets = True use_embedded_content = False remove_javascript = True INDEX = 'http://www.nationalgeographic.de/archive/2008-05' def parse_index(self): articles = [] soup = self.index_to_soup(self.INDEX) feeds = [] for section in soup.findAll('div', attrs={'class':'searchresult_text'}): section_title = self.tag_to_string(section.find('headline-middle_no_margin black')) articles = [] for post in section.findAll('a', href=True): url = post['href'] if url.startswith('/'): url = 'http://www.nationalgeographic.de'+url title = self.tag_to_string(post) if str(post).find('class=') > 0: klass = post['class'] if klass != "": self.log() self.log('--> post: ', post) self.log('--> url: ', url) self.log('--> title: ', title) self.log('--> class: ', klass) articles.append({'title':title, 'url':url, 'section':section, 'section_title':section_title}) if articles: feeds.append((section_title, articles)) return feeds keep_only_tags = [dict(attrs={'class':['contentbox_no_top_border']})] |
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,188
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Presumably section_title is not being set correctly in your parse_index method.
|
![]() |
![]() |
Advert | |
|
![]() |
#3 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
You are asking it to find a tag named 'headline-middle_no_margin black' when what you want is a tag named div with a class named 'headline-middle_no_margin black.' Look at your findAll on the line above the line defining section_title. Try this: Code:
section_title = self.tag_to_string(section.find('div', attrs={'class':'headline-middle_no_margin black'})) |
|
![]() |
![]() |
![]() |
#4 |
Zealot
![]() ![]() Posts: 119
Karma: 100
Join Date: Jan 2011
Location: Germany / NRW /Köln
Device: prs-650 / prs-350 /kindle 3
|
hi,
that was not the problem. i strip out the section_title out of the url, now Code:
for post in section.findAll('a', href=True): url = post['href'] split_url = url.split("/") section_title = split_url[1] if url.startswith('/'): and i don't know why it is ![]() |
![]() |
![]() |
![]() |
#5 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
I'm not sure what you are saying here. I tried your original code and my modified code. Yours is (was?) filled with "Unknown Feed" as you posted. My change had the correct titles. Perhaps you are saying you fixed this error in another way, but you are still having an error in the formatting. Not reading German, it's hard for me to be sure whether there is an error there. I'll take your word for it that there is.
![]() |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Zealot
![]() ![]() Posts: 119
Karma: 100
Join Date: Jan 2011
Location: Germany / NRW /Köln
Device: prs-650 / prs-350 /kindle 3
|
hi starson,
o.k. i will try another explanation. the problem is the formatting of the toc. the normal way is: main1 --first article --second article ect. main2 --first article --second article ect. but it is: main1 first article main1 second article and so on. every artikel shows always the section_title before itself. the normal way, i think, is that all articles inside the same section_title if they are equal. |
![]() |
![]() |
![]() |
#7 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
The TOC issue looks like your code is not creating the correct feeds. Each feed should have multiple links, one to each article. Print the feeds you have created and check. |
|
![]() |
![]() |
![]() |
#8 |
Zealot
![]() ![]() Posts: 119
Karma: 100
Join Date: Jan 2011
Location: Germany / NRW /Köln
Device: prs-650 / prs-350 /kindle 3
|
work on it but no way
the toc problem isn't solved (will do it later)
![]() the new problem is that only the last month-archive is full content. so i had to get the date from today and change it. at the moment i don't know the right way because i've got error-message on run. this is my try: Code:
import string, re from calibre import strftime from dateutil import relativedelta from calibre.web.feeds.recipes import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup class AdvancedUserRecipe(BasicNewsRecipe): title = 'National Geo_test_username3' description = 'Magazin des NG ' __author__ = 'schuster' publisher = 'Aus dem Online-Archiv des NG' language = 'de' cover_url = 'http://www.nationalgeographic.de/images/national-geographic-logo.jpg' masthead_url = 'http://www.nationalgeographic.de/images/national-geographic-logo.jpg' # here i try to manage that the actually month change to the last month, because only last month is fully in archive #in second is handling of the january because there will not be a month before, so change the year one back and set month to 12 def date_manage(): year_norm = strftime('%Y') month_norm = strftime('%m') year_min = unicode(int(strftime('%Y')) - 1) month_min = unicode(int(strftime('%m')[1]) - 2) if (strftime('%Y')) <= 1: year = year_min month = 12 print '------------->beginning/end of year date' +year + month return year, month else: year = year_norm month = month_min print '------------> normaldate' +year +month return year, month #change the INDEX INDEX = 'http://www.nationalgeographic.de/archive/'+ year + '-' + month print INDEX #grab the content def parse_index(self): articles = [] soup = self.index_to_soup(self.INDEX) feeds = [] for section in soup.findAll('div', attrs={'class':'searchresult_text'}): section_title = self.tag_to_string(section.find('headline-middle_no_margin black')) articles = [] for post in section.findAll('a', href=True): url = post['href'] split_url = url.split("/") section_title = split_url[1] if url.startswith('/'): url = 'http://www.nationalgeographic.de'+url title = self.tag_to_string(post) if str(post).find('class=') > 0: klass = post['class'] if klass != "": self.log() self.log('--> post: ', post) self.log('--> url: ', url) self.log('--> title: ', title) self.log('--> class: ', klass) articles.append({'title':title, 'url':url}) if articles: feeds.append((section_title, articles)) #manage of build the toc is incorrect (need change) return feeds keep_only_tags = [dict(attrs={'class':['contentbox_no_top_border']})] remove_tags = [dict(name='div', attrs={'class':'gallery'})] ![]() ![]() ![]() |
![]() |
![]() |
![]() |
#9 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
![]() Edit:I see it's probably the second. Can you get the current date? Last edited by Starson17; 06-10-2011 at 10:47 AM. |
|
![]() |
![]() |
![]() |
#10 |
Zealot
![]() ![]() Posts: 119
Karma: 100
Join Date: Jan 2011
Location: Germany / NRW /Köln
Device: prs-650 / prs-350 /kindle 3
|
yes,
that is the thing i try to do. the actuall date and one month back is no problem. i believe the problem is that i had to set the return from "def date_manage():" as "global " . the parse_index is locking for INDEX as an global variable. please do not laugh at my pathetic attempt ![]() this all is great to lern something. It's fun if you're always a little further than yesterday. but for an alltime beginner it is very hard. all i want to know at this stage is: is it the right way? or even to complicated in thinking? |
![]() |
![]() |
![]() |
#11 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Code:
import strftime testdate = strftime(self.timefmt) print 'testdate is: ', testdate |
|
![]() |
![]() |
![]() |
#12 |
Zealot
![]() ![]() Posts: 119
Karma: 100
Join Date: Jan 2011
Location: Germany / NRW /Köln
Device: prs-650 / prs-350 /kindle 3
|
got it
![]() ![]() ![]() ![]() ![]() after a fresh cup of coffee the building of url is O.K. you are right there are many way's to solve a probleme. this one IS solved right now. here the way: Code:
import string, re from calibre import strftime class AdvancedUserRecipe(BasicNewsRecipe): title = '007_National Geo_run' description = 'Magazin des NG ' __author__ = 'schuster' publisher = 'Aus dem Online-Archiv des NG' language = 'de' cover_url = 'http://www.nationalgeographic.de/images/national-geographic-logo.jpg' masthead_url = 'http://www.nationalgeographic.de/images/national-geographic-logo.jpg' INDEX = 'http://www.nationalgeographic.de/archive/' def parse_index(self): ##-------------------------------------------------------------------------------------- year_norm = strftime('%Y') ## get the year as string month_norm = strftime('%m') ## get the month as string year_min = unicode(int(strftime('%Y')) - 1) ## string to unicode month_min = unicode(int(strftime('%m')[1]) - 1) ## string to unicode if (strftime('%m')) <= 1: ## if it is january year = year_min ## get year_min that is minus one year. so i get the last year month = 12 ## and set month to december else: ## otherway year = year_norm ## get the year today month = month_min ## and the month minus one month. to get the last month, that had the hole content ##-------------------------------------------------------------------------------------- articles = [] soup = self.index_to_soup(self.INDEX+ year + '-' + month) feeds = [] for section in soup.findAll('div', attrs={'class':'searchresult_text'}): section_title = self.tag_to_string(section.find('headline-middle_no_margin black')) articles = [] for post in section.findAll('a', href=True): url = post['href'] split_url = url.split("/") section_title = split_url[1] if url.startswith('/'): url = 'http://www.nationalgeographic.de'+url title = self.tag_to_string(post) if str(post).find('class=') > 0: klass = post['class'] if klass != "": self.log() self.log('--> post: ', post) self.log('--> url: ', url) self.log('--> title: ', title) self.log('--> class: ', klass) articles.append({'title':title, 'url':url}) if articles: feeds.append((section_title, articles)) return feeds keep_only_tags = [dict(attrs={'class':['contentbox_no_top_border']})] remove_tags = [dict(name='div', attrs={'class':'gallery'})] ![]() - - - - - - - - - - - - - - - - - - next prob is the right output of the toc, let me see........... Last edited by schuster; 06-10-2011 at 02:27 PM. |
![]() |
![]() |
![]() |
#13 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Release : New Tool to Edit ePubs TOC, Edit with Sigil and keep you TOC | Nigol | ePub | 105 | 10-29-2012 11:40 AM |
Inline TOC from toc.ncx | elmago79 | Kindle Formats | 38 | 03-25-2011 12:56 PM |
ePub TOC to mobi TOC | edmnddntes | Conversion | 5 | 01-24-2011 02:56 AM |
I need help with TOC | Lilly | Sigil | 2 | 11-02-2010 09:31 PM |
Making a TOC for LRFs? Issues with Calibre + LRF TOC editor not working | Magitek | LRF | 0 | 05-06-2009 01:25 PM |