Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 06-06-2011, 03:01 PM   #1
schuster
Zealot
schuster doesn't litterschuster doesn't litter
 
Posts: 116
Karma: 100
Join Date: Jan 2011
Location: Germany / NRW /Köln
Device: prs-650 / prs-350 /kindle 3
it work's, but the toc isn't right

hi,
my problem today (i'm in lerning process with this stuff)
this recipe work's but in the mobi-book it shows me not the right things.

normally it is:

main1
--first article
--second article
ect.

main2
--first article
--second article
ect.

but it is:

unknown feed
first article

unknown feed
second article

after a few hours of testing and trying i don't know the way



Code:
class AdvancedUserRecipe(BasicNewsRecipe):

    title = 'National_Geo_test_6'
    description = '111beschreibung111'
    __author__ = 'irgendwer'
    publisher = 'jaja'
    language = 'de'
    oldest_article = 2
    max_articles_per_feed = 35
    no_stylesheets         = True
    use_embedded_content   = False
    remove_javascript      = True
    INDEX = 'http://www.nationalgeographic.de/archive/2008-05'
    def parse_index(self):
        articles = []
        soup = self.index_to_soup(self.INDEX)
        feeds = []
        for section in soup.findAll('div', attrs={'class':'searchresult_text'}):
            section_title = self.tag_to_string(section.find('headline-middle_no_margin black'))
            articles = []
            for post in section.findAll('a', href=True):
                url = post['href']
                if url.startswith('/'):
                  url = 'http://www.nationalgeographic.de'+url
                  title = self.tag_to_string(post)
                  if str(post).find('class=') > 0:
                    klass = post['class']
                    if klass != "":
                      self.log()
                      self.log('--> post:  ', post)
                      self.log('--> url:   ', url)
                      self.log('--> title: ', title)
                      self.log('--> class: ', klass)
                      articles.append({'title':title, 'url':url, 'section':section, 'section_title':section_title})
            if articles:
                feeds.append((section_title, articles))
        return feeds

    keep_only_tags = [dict(attrs={'class':['contentbox_no_top_border']})]
schuster is offline   Reply With Quote
Old 06-06-2011, 03:28 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 26,432
Karma: 5383257
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Presumably section_title is not being set correctly in your parse_index method.
kovidgoyal is online now   Reply With Quote
Old 06-06-2011, 04:51 PM   #3
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by schuster View Post
hi,
my problem today (i'm in lerning process with this stuff)
this recipe work's but in the mobi-book it shows me not the right things.
after a few hours of testing and trying i don't know the way
Spoiler:
Code:
class AdvancedUserRecipe(BasicNewsRecipe):

    title = 'National_Geo_test_6'
    description = '111beschreibung111'
    __author__ = 'irgendwer'
    publisher = 'jaja'
    language = 'de'
    oldest_article = 2
    max_articles_per_feed = 35
    no_stylesheets         = True
    use_embedded_content   = False
    remove_javascript      = True
    INDEX = 'http://www.nationalgeographic.de/archive/2008-05'
    def parse_index(self):
        articles = []
        soup = self.index_to_soup(self.INDEX)
        feeds = []
        for section in soup.findAll('div', attrs={'class':'searchresult_text'}):
            section_title = self.tag_to_string(section.find('headline-middle_no_margin black'))
            articles = []
            for post in section.findAll('a', href=True):
                url = post['href']
                if url.startswith('/'):
                  url = 'http://www.nationalgeographic.de'+url
                  title = self.tag_to_string(post)
                  if str(post).find('class=') > 0:
                    klass = post['class']
                    if klass != "":
                      self.log()
                      self.log('--> post:  ', post)
                      self.log('--> url:   ', url)
                      self.log('--> title: ', title)
                      self.log('--> class: ', klass)
                      articles.append({'title':title, 'url':url, 'section':section, 'section_title':section_title})
            if articles:
                feeds.append((section_title, articles))
        return feeds

    keep_only_tags = [dict(attrs={'class':['contentbox_no_top_border']})]
Kovid is right.
You are asking it to find a tag named 'headline-middle_no_margin black' when what you want is a tag named div with a class named 'headline-middle_no_margin black.' Look at your findAll on the line above the line defining section_title.
Try this:
Code:
section_title = self.tag_to_string(section.find('div', attrs={'class':'headline-middle_no_margin black'}))
Starson17 is offline   Reply With Quote
Old 06-07-2011, 07:23 AM   #4
schuster
Zealot
schuster doesn't litterschuster doesn't litter
 
Posts: 116
Karma: 100
Join Date: Jan 2011
Location: Germany / NRW /Köln
Device: prs-650 / prs-350 /kindle 3
hi,
that was not the problem. i strip out the section_title out of the url, now
Code:
           for post in section.findAll('a', href=True):
                url = post['href']
                split_url = url.split("/")
                section_title = split_url[1]
                if url.startswith('/'):
the problem is that the toc shows like the pic
and i don't know why it is
Attached Thumbnails
Click image for larger version

Name:	inhaltsverzeichniss.PNG
Views:	55
Size:	14.6 KB
ID:	72436  
schuster is offline   Reply With Quote
Old 06-07-2011, 10:11 AM   #5
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by schuster View Post
hi,
that was not the problem.
I'm not sure what you are saying here. I tried your original code and my modified code. Yours is (was?) filled with "Unknown Feed" as you posted. My change had the correct titles. Perhaps you are saying you fixed this error in another way, but you are still having an error in the formatting. Not reading German, it's hard for me to be sure whether there is an error there. I'll take your word for it that there is.
Starson17 is offline   Reply With Quote
Old 06-07-2011, 11:00 AM   #6
schuster
Zealot
schuster doesn't litterschuster doesn't litter
 
Posts: 116
Karma: 100
Join Date: Jan 2011
Location: Germany / NRW /Köln
Device: prs-650 / prs-350 /kindle 3
hi starson,
o.k. i will try another explanation.

the problem is the formatting of the toc.

the normal way is:

main1
--first article
--second article
ect.

main2
--first article
--second article
ect.

but it is:

main1
first article

main1
second article

and so on.

every artikel shows always the section_title before itself.

the normal way, i think, is that all articles inside the same section_title if they are equal.
schuster is offline   Reply With Quote
Old 06-07-2011, 12:04 PM   #7
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by schuster View Post
hi starson,
o.k. i will try another explanation.
the problem is the formatting of the toc.
Yes, I understand. I was addressing the "Unknown Feed" problem you were having, as that was apparent from the code you posted after I read Kovid's comment. I didn't bother to run it and see if it also fixed the TOC issue.

The TOC issue looks like your code is not creating the correct feeds. Each feed should have multiple links, one to each article. Print the feeds you have created and check.
Starson17 is offline   Reply With Quote
Old 06-09-2011, 01:42 AM   #8
schuster
Zealot
schuster doesn't litterschuster doesn't litter
 
Posts: 116
Karma: 100
Join Date: Jan 2011
Location: Germany / NRW /Köln
Device: prs-650 / prs-350 /kindle 3
work on it but no way

the toc problem isn't solved (will do it later)

the new problem is that only the last month-archive is full content.
so i had to get the date from today and change it.
at the moment i don't know the right way because i've got error-message on run.
this is my try:

Code:
import string, re
from calibre import strftime
from dateutil import relativedelta
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup
class AdvancedUserRecipe(BasicNewsRecipe):

    title = 'National Geo_test_username3'
    description = 'Magazin des NG '
    __author__ = 'schuster'
    publisher = 'Aus dem Online-Archiv des NG'
    language = 'de'
    cover_url = 'http://www.nationalgeographic.de/images/national-geographic-logo.jpg'
    masthead_url = 'http://www.nationalgeographic.de/images/national-geographic-logo.jpg'

# here i  try to manage that the actually month change to the last month, because only last month is fully in archive
#in second is handling of the january because there will not be a month before, so change the year one back and set month to 12
    def date_manage():
      year_norm = strftime('%Y')
      month_norm = strftime('%m')
      year_min = unicode(int(strftime('%Y')) - 1)
      month_min = unicode(int(strftime('%m')[1]) - 2)
      if (strftime('%Y')) <= 1:
          year = year_min
          month = 12
          print '------------->beginning/end of year date' +year + month
          return year, month

      else:
          year = year_norm
          month = month_min
      print '------------> normaldate' +year +month
      return year, month


#change the INDEX
      INDEX = 'http://www.nationalgeographic.de/archive/'+ year + '-' + month
      print INDEX

#grab the content
      def parse_index(self):
          articles = []
          soup = self.index_to_soup(self.INDEX)
          feeds = []
          for section in soup.findAll('div', attrs={'class':'searchresult_text'}):
              section_title = self.tag_to_string(section.find('headline-middle_no_margin black'))
              articles = []
              for post in section.findAll('a', href=True):
                  url = post['href']
                  split_url = url.split("/")
                  section_title = split_url[1]
                  if url.startswith('/'):                  url = 'http://www.nationalgeographic.de'+url
                  title = self.tag_to_string(post)
                  if str(post).find('class=') > 0:
                      klass = post['class']
                      if klass != "":
                        self.log()
                        self.log('--> post:  ', post)
                        self.log('--> url:   ', url)
                        self.log('--> title: ', title)
                        self.log('--> class: ', klass)
                        articles.append({'title':title, 'url':url})
              if articles:
                  feeds.append((section_title, articles)) #manage of build the toc is incorrect (need change)
          return feeds


    keep_only_tags = [dict(attrs={'class':['contentbox_no_top_border']})]
    remove_tags = [dict(name='div', attrs={'class':'gallery'})]
need help on this attempt to change the date

schuster is offline   Reply With Quote
Old 06-10-2011, 11:45 AM   #9
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by schuster View Post
the new problem is that only the last month-archive is full content.
so i had to get the date from today and change it.
at the moment i don't know the right way because i've got error-message on run.
need help on this attempt to change the date
Can you explain what you are trying to do (so i don't have to read the code )? Are you trying to send some date info to the site (cookie, header, etc.) so it sends you something different from the same URL? Are you trying to retrieve something from the site by calculating a date and using that date as part of the URL? Are you trying to change the date displayed by the recipe in the title? Are you trying to control what the recipe tries to fetch by adjusting the date relative to the article age parameter?

Edit:I see it's probably the second. Can you get the current date?

Last edited by Starson17; 06-10-2011 at 11:47 AM.
Starson17 is offline   Reply With Quote
Old 06-10-2011, 02:17 PM   #10
schuster
Zealot
schuster doesn't litterschuster doesn't litter
 
Posts: 116
Karma: 100
Join Date: Jan 2011
Location: Germany / NRW /Köln
Device: prs-650 / prs-350 /kindle 3
yes,
that is the thing i try to do.
the actuall date and one month back is no problem.
i believe the problem is that i had to set the return from "def date_manage():" as "global " .
the parse_index is locking for INDEX as an global variable.

please do not laugh at my pathetic attempt
this all is great to lern something. It's fun if you're always a little further than yesterday.

but for an alltime beginner it is very hard.

all i want to know at this stage is:
is it the right way?
or even to complicated in thinking?
schuster is offline   Reply With Quote
Old 06-10-2011, 02:28 PM   #11
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by schuster View Post
all i want to know at this stage is:
is it the right way?
or even to complicated in thinking?
If you are doing what I think you are doing, it looks like one "right way" to me. There are many right ways. First, you get the current date. You get that with
Code:
import strftime
testdate = strftime(self.timefmt)
print 'testdate is: ', testdate
It looks like you are trying to fetch a URL using the month previous to the current month, so you need to figure out what that month is, then build the URL. You can do that manually, or use date handling functions.
Starson17 is offline   Reply With Quote
Old 06-10-2011, 03:05 PM   #12
schuster
Zealot
schuster doesn't litterschuster doesn't litter
 
Posts: 116
Karma: 100
Join Date: Jan 2011
Location: Germany / NRW /Köln
Device: prs-650 / prs-350 /kindle 3
got it



after a fresh cup of coffee the building of url is O.K.

you are right there are many way's to solve a probleme.
this one IS solved right now.

here the way:

Code:
import string, re
from calibre import strftime
class AdvancedUserRecipe(BasicNewsRecipe):

    title = '007_National Geo_run'
    description = 'Magazin des NG '
    __author__ = 'schuster'
    publisher = 'Aus dem Online-Archiv des NG'
    language = 'de'
    cover_url = 'http://www.nationalgeographic.de/images/national-geographic-logo.jpg'
    masthead_url = 'http://www.nationalgeographic.de/images/national-geographic-logo.jpg'
    INDEX = 'http://www.nationalgeographic.de/archive/'

    def parse_index(self):
##--------------------------------------------------------------------------------------
        year_norm = strftime('%Y')                      ## get the year as string
        month_norm = strftime('%m')                     ## get the month as string
        year_min = unicode(int(strftime('%Y')) - 1)     ## string to unicode
        month_min = unicode(int(strftime('%m')[1]) - 1) ## string to unicode
        if (strftime('%m')) <= 1:                       ## if it is january
           year = year_min                              ## get year_min that is minus one year. so i get the last year
           month = 12                                   ## and set month to december
        else:                                           ## otherway
              year = year_norm                          ## get the year today
              month = month_min                         ## and the month minus one month. to get the last month, that had the hole content
##--------------------------------------------------------------------------------------
        articles = []
        soup = self.index_to_soup(self.INDEX+ year + '-' + month)
        feeds = []
        for section in soup.findAll('div', attrs={'class':'searchresult_text'}):
              section_title = self.tag_to_string(section.find('headline-middle_no_margin black'))
              articles = []
              for post in section.findAll('a', href=True):
                  url = post['href']
                  split_url = url.split("/")
                  section_title = split_url[1]
                  if url.startswith('/'):
                   url = 'http://www.nationalgeographic.de'+url
                   title = self.tag_to_string(post)
                   if str(post).find('class=') > 0:
                      klass = post['class']
                      if klass != "":
                        self.log()
                        self.log('--> post:  ', post)
                        self.log('--> url:   ', url)
                        self.log('--> title: ', title)
                        self.log('--> class: ', klass)
                        articles.append({'title':title, 'url':url})
              if articles:
                  feeds.append((section_title, articles))
        return feeds


    keep_only_tags = [dict(attrs={'class':['contentbox_no_top_border']})]
    remove_tags = [dict(name='div', attrs={'class':'gallery'})]
great feeling

- - - - - - - - - - - - - - - - - -

next prob is the right output of the toc, let me see...........

Last edited by schuster; 06-10-2011 at 03:27 PM.
schuster is offline   Reply With Quote
Old 06-10-2011, 04:16 PM   #13
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by schuster View Post
great feeling
It is a great feeling, isn't it. Congratulations!
Starson17 is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Release : New Tool to Edit ePubs TOC, Edit with Sigil and keep you TOC Nigol ePub 105 10-29-2012 12:40 PM
Inline TOC from toc.ncx elmago79 Kindle Formats 38 03-25-2011 01:56 PM
ePub TOC to mobi TOC edmnddntes Conversion 5 01-24-2011 03:56 AM
I need help with TOC Lilly Sigil 2 11-02-2010 10:31 PM
Making a TOC for LRFs? Issues with Calibre + LRF TOC editor not working Magitek LRF 0 05-06-2009 02:25 PM


All times are GMT -4. The time now is 03:24 AM.


MobileRead.com is a privately owned, operated and funded community.