it work's, but the toc isn't right

schuster · 06-06-2011, 02:01 PM

hi,
my problem today (i'm in lerning process with this stuff)
this recipe work's but in the mobi-book it shows me not the right things.

normally it is:

main1
--first article
--second article
ect.

main2
--first article
--second article
ect.

but it is:

unknown feed
first article

unknown feed
second article

after a few hours of testing and trying i don't know the way

Code:

class AdvancedUserRecipe(BasicNewsRecipe):

    title = 'National_Geo_test_6'
    description = '111beschreibung111'
    __author__ = 'irgendwer'
    publisher = 'jaja'
    language = 'de'
    oldest_article = 2
    max_articles_per_feed = 35
    no_stylesheets         = True
    use_embedded_content   = False
    remove_javascript      = True
    INDEX = 'http://www.nationalgeographic.de/archive/2008-05'
    def parse_index(self):
        articles = []
        soup = self.index_to_soup(self.INDEX)
        feeds = []
        for section in soup.findAll('div', attrs={'class':'searchresult_text'}):
            section_title = self.tag_to_string(section.find('headline-middle_no_margin black'))
            articles = []
            for post in section.findAll('a', href=True):
                url = post['href']
                if url.startswith('/'):
                  url = 'http://www.nationalgeographic.de'+url
                  title = self.tag_to_string(post)
                  if str(post).find('class=') > 0:
                    klass = post['class']
                    if klass != "":
                      self.log()
                      self.log('--> post:  ', post)
                      self.log('--> url:   ', url)
                      self.log('--> title: ', title)
                      self.log('--> class: ', klass)
                      articles.append({'title':title, 'url':url, 'section':section, 'section_title':section_title})
            if articles:
                feeds.append((section_title, articles))
        return feeds

    keep_only_tags = [dict(attrs={'class':['contentbox_no_top_border']})]

kovidgoyal · 06-06-2011, 02:28 PM

Presumably section_title is not being set correctly in your parse_index method.

Starson17 · 06-06-2011, 03:51 PM

Quote:

Originally Posted by schuster

hi,
my problem today (i'm in lerning process with this stuff)
this recipe work's but in the mobi-book it shows me not the right things.
after a few hours of testing and trying i don't know the way

Spoiler:

Kovid is right.
You are asking it to find a tag named 'headline-middle_no_margin black' when what you want is a tag named div with a class named 'headline-middle_no_margin black.' Look at your findAll on the line above the line defining section_title.
Try this:

Code:

section_title = self.tag_to_string(section.find('div', attrs={'class':'headline-middle_no_margin black'}))

schuster · 06-07-2011, 06:23 AM

hi,
that was not the problem. i strip out the section_title out of the url, now

Code:

           for post in section.findAll('a', href=True):
                url = post['href']
                split_url = url.split("/")
                section_title = split_url[1]
                if url.startswith('/'):

the problem is that the toc shows like the pic
and i don't know why it is

Starson17 · 06-07-2011, 09:11 AM

Quote:

Originally Posted by schuster

hi,
that was not the problem.

I'm not sure what you are saying here. I tried your original code and my modified code. Yours is (was?) filled with "Unknown Feed" as you posted. My change had the correct titles. Perhaps you are saying you fixed this error in another way, but you are still having an error in the formatting. Not reading German, it's hard for me to be sure whether there is an error there. I'll take your word for it that there is.

schuster · 06-07-2011, 10:00 AM

hi starson,
o.k. i will try another explanation.

the problem is the formatting of the toc.

the normal way is:

main1
--first article
--second article
ect.

main2
--first article
--second article
ect.

but it is:

main1
first article

main1
second article

and so on.

every artikel shows always the section_title before itself.

the normal way, i think, is that all articles inside the same section_title if they are equal.

Starson17 · 06-07-2011, 11:04 AM

Quote:

Originally Posted by schuster

hi starson,
o.k. i will try another explanation.
the problem is the formatting of the toc.

Yes, I understand. I was addressing the "Unknown Feed" problem you were having, as that was apparent from the code you posted after I read Kovid's comment. I didn't bother to run it and see if it also fixed the TOC issue.

The TOC issue looks like your code is not creating the correct feeds. Each feed should have multiple links, one to each article. Print the feeds you have created and check.

schuster · 06-09-2011, 12:42 AM

the toc problem isn't solved (will do it later)

the new problem is that only the last month-archive is full content.
so i had to get the date from today and change it.
at the moment i don't know the right way because i've got error-message on run.
this is my try:

Code:

import string, re
from calibre import strftime
from dateutil import relativedelta
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup
class AdvancedUserRecipe(BasicNewsRecipe):

    title = 'National Geo_test_username3'
    description = 'Magazin des NG '
    __author__ = 'schuster'
    publisher = 'Aus dem Online-Archiv des NG'
    language = 'de'
    cover_url = 'http://www.nationalgeographic.de/images/national-geographic-logo.jpg'
    masthead_url = 'http://www.nationalgeographic.de/images/national-geographic-logo.jpg'

# here i  try to manage that the actually month change to the last month, because only last month is fully in archive
#in second is handling of the january because there will not be a month before, so change the year one back and set month to 12
    def date_manage():
      year_norm = strftime('%Y')
      month_norm = strftime('%m')
      year_min = unicode(int(strftime('%Y')) - 1)
      month_min = unicode(int(strftime('%m')[1]) - 2)
      if (strftime('%Y')) <= 1:
          year = year_min
          month = 12
          print '------------->beginning/end of year date' +year + month
          return year, month

      else:
          year = year_norm
          month = month_min
      print '------------> normaldate' +year +month
      return year, month


#change the INDEX
      INDEX = 'http://www.nationalgeographic.de/archive/'+ year + '-' + month
      print INDEX

#grab the content
      def parse_index(self):
          articles = []
          soup = self.index_to_soup(self.INDEX)
          feeds = []
          for section in soup.findAll('div', attrs={'class':'searchresult_text'}):
              section_title = self.tag_to_string(section.find('headline-middle_no_margin black'))
              articles = []
              for post in section.findAll('a', href=True):
                  url = post['href']
                  split_url = url.split("/")
                  section_title = split_url[1]
                  if url.startswith('/'):                  url = 'http://www.nationalgeographic.de'+url
                  title = self.tag_to_string(post)
                  if str(post).find('class=') > 0:
                      klass = post['class']
                      if klass != "":
                        self.log()
                        self.log('--> post:  ', post)
                        self.log('--> url:   ', url)
                        self.log('--> title: ', title)
                        self.log('--> class: ', klass)
                        articles.append({'title':title, 'url':url})
              if articles:
                  feeds.append((section_title, articles)) #manage of build the toc is incorrect (need change)
          return feeds


    keep_only_tags = [dict(attrs={'class':['contentbox_no_top_border']})]
    remove_tags = [dict(name='div', attrs={'class':'gallery'})]

need help on this attempt to change the date

Starson17 · 06-10-2011, 10:45 AM

Quote:

Originally Posted by schuster

the new problem is that only the last month-archive is full content.
so i had to get the date from today and change it.
at the moment i don't know the right way because i've got error-message on run.
need help on this attempt to change the date

Can you explain what you are trying to do (so i don't have to read the code

)? Are you trying to send some date info to the site (cookie, header, etc.) so it sends you something different from the same URL? Are you trying to retrieve something from the site by calculating a date and using that date as part of the URL? Are you trying to change the date displayed by the recipe in the title? Are you trying to control what the recipe tries to fetch by adjusting the date relative to the article age parameter?

Edit:I see it's probably the second. Can you get the current date?

schuster · 06-10-2011, 01:17 PM

yes,
that is the thing i try to do.
the actuall date and one month back is no problem.
i believe the problem is that i had to set the return from "def date_manage():" as "global " .
the parse_index is locking for INDEX as an global variable.

please do not laugh at my pathetic attempt

this all is great to lern something. It's fun if you're always a little further than yesterday.

but for an alltime beginner it is very hard.

all i want to know at this stage is:
is it the right way?
or even to complicated in thinking?

Starson17 · 06-10-2011, 01:28 PM

Quote:

Originally Posted by schuster

all i want to know at this stage is:
is it the right way?
or even to complicated in thinking?

If you are doing what I think you are doing, it looks like one "right way" to me. There are many right ways. First, you get the current date. You get that with

Code:

import strftime
testdate = strftime(self.timefmt)
print 'testdate is: ', testdate

It looks like you are trying to fetch a URL using the month previous to the current month, so you need to figure out what that month is, then build the URL. You can do that manually, or use date handling functions.

schuster · 06-10-2011, 02:05 PM

after a fresh cup of coffee the building of url is O.K.

you are right there are many way's to solve a probleme.
this one IS solved right now.

here the way:

Code:

import string, re
from calibre import strftime
class AdvancedUserRecipe(BasicNewsRecipe):

    title = '007_National Geo_run'
    description = 'Magazin des NG '
    __author__ = 'schuster'
    publisher = 'Aus dem Online-Archiv des NG'
    language = 'de'
    cover_url = 'http://www.nationalgeographic.de/images/national-geographic-logo.jpg'
    masthead_url = 'http://www.nationalgeographic.de/images/national-geographic-logo.jpg'
    INDEX = 'http://www.nationalgeographic.de/archive/'

    def parse_index(self):
##--------------------------------------------------------------------------------------
        year_norm = strftime('%Y')                      ## get the year as string
        month_norm = strftime('%m')                     ## get the month as string
        year_min = unicode(int(strftime('%Y')) - 1)     ## string to unicode
        month_min = unicode(int(strftime('%m')[1]) - 1) ## string to unicode
        if (strftime('%m')) <= 1:                       ## if it is january
           year = year_min                              ## get year_min that is minus one year. so i get the last year
           month = 12                                   ## and set month to december
        else:                                           ## otherway
              year = year_norm                          ## get the year today
              month = month_min                         ## and the month minus one month. to get the last month, that had the hole content
##--------------------------------------------------------------------------------------
        articles = []
        soup = self.index_to_soup(self.INDEX+ year + '-' + month)
        feeds = []
        for section in soup.findAll('div', attrs={'class':'searchresult_text'}):
              section_title = self.tag_to_string(section.find('headline-middle_no_margin black'))
              articles = []
              for post in section.findAll('a', href=True):
                  url = post['href']
                  split_url = url.split("/")
                  section_title = split_url[1]
                  if url.startswith('/'):
                   url = 'http://www.nationalgeographic.de'+url
                   title = self.tag_to_string(post)
                   if str(post).find('class=') > 0:
                      klass = post['class']
                      if klass != "":
                        self.log()
                        self.log('--> post:  ', post)
                        self.log('--> url:   ', url)
                        self.log('--> title: ', title)
                        self.log('--> class: ', klass)
                        articles.append({'title':title, 'url':url})
              if articles:
                  feeds.append((section_title, articles))
        return feeds


    keep_only_tags = [dict(attrs={'class':['contentbox_no_top_border']})]
    remove_tags = [dict(name='div', attrs={'class':'gallery'})]

great feeling

- - - - - - - - - - - - - - - - - -

next prob is the right output of the toc, let me see...........

Starson17 · 06-10-2011, 03:16 PM

Quote:

Originally Posted by schuster

great feeling

It is a great feeling, isn't it. Congratulations!

06-07-2011, 06:23 AM	#4
schuster Zealot Posts: 119 Karma: 100 Join Date: Jan 2011 Location: Germany / NRW /Köln Device: prs-650 / prs-350 /kindle 3	hi, that was not the problem. i strip out the section_title out of the url, now Code: for post in section.findAll('a', href=True): url = post['href'] split_url = url.split("/") section_title = split_url[1] if url.startswith('/'): the problem is that the toc shows like the pic and i don't know why it is Attached Thumbnails

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Release : New Tool to Edit ePubs TOC, Edit with Sigil and keep you TOC	Nigol	ePub	105	10-29-2012 11:40 AM
Inline TOC from toc.ncx	elmago79	Kindle Formats	38	03-25-2011 12:56 PM
ePub TOC to mobi TOC	edmnddntes	Conversion	5	01-24-2011 02:56 AM
I need help with TOC	Lilly	Sigil	2	11-02-2010 09:31 PM
Making a TOC for LRFs? Issues with Calibre + LRF TOC editor not working	Magitek	LRF	0	05-06-2009 01:25 PM

06-06-2011, 02:28 PM	#2
kovidgoyal creator of calibre Posts: 43,856 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Presumably section_title is not being set correctly in your parse_index method.

06-07-2011, 10:00 AM	#6
schuster Zealot Posts: 119 Karma: 100 Join Date: Jan 2011 Location: Germany / NRW /Köln Device: prs-650 / prs-350 /kindle 3	hi starson, o.k. i will try another explanation. the problem is the formatting of the toc. the normal way is: main1 --first article --second article ect. main2 --first article --second article ect. but it is: main1 first article main1 second article and so on. every artikel shows always the section_title before itself. the normal way, i think, is that all articles inside the same section_title if they are equal.

06-10-2011, 01:17 PM	#10
schuster Zealot Posts: 119 Karma: 100 Join Date: Jan 2011 Location: Germany / NRW /Köln Device: prs-650 / prs-350 /kindle 3	yes, that is the thing i try to do. the actuall date and one month back is no problem. i believe the problem is that i had to set the return from "def date_manage():" as "global " . the parse_index is locking for INDEX as an global variable. please do not laugh at my pathetic attempt this all is great to lern something. It's fun if you're always a little further than yesterday. but for an alltime beginner it is very hard. all i want to know at this stage is: is it the right way? or even to complicated in thinking?

Advert

Advert