Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Closed Thread
 
Thread Tools Search this Thread
Old 09-16-2010, 04:11 PM   #2731
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
Starson17,
I simply am not understanding it yet even after reading the documentation and some of the code you have posted. Basically I'm wondering why this will not work...
Code:
      for credit_tag in soup.findAll('span', attrs={'class':['imageCredit rightFloat']}):
       p = Tag(soup, 'p')
      span.replaceWith(p)
I get no soup and the article is blank
1) You should be doing a credit_tag.replaceWith(p) instead of span.replaceWith(p)
2) It's not sufficient to just replace your found credit_tag (the span tags) with your new Tag p. You created p from whole cloth. It has no contents, so when you did the replaceWith (or tried to) you lost everything inside the span. A Tag object is the entire tag, from <span> to </span> or <p> to </p>, with all content therebetween. Don't expect it to do a search and replace of the text string "span" with the text "p."

If you want that, similar code is often used to regex replace all <table>, <tr> and <td> with <div>. I believe that code is posted here and used in lots of (older) recipes. It's basically what linearize_tables does. Let me know if you need me to find and post it.

Last edited by Starson17; 09-16-2010 at 04:15 PM.
Starson17 is offline  
Old 09-16-2010, 04:14 PM   #2732
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by marbs View Post
Thanks for the quick reply. So i fixed the code (it is indented now).
Still a bit lost on how to test the thing.
Opened CMD. Navigated to calibre folder. Ran "ebook-convert --list-recipes" got a list of 588 recipes. Found the recipe folder and added my recipe (the_marker.recipe) to the folder. I still have a list of only 588. Whats more, when i try "ebook-convert TheSun.recipe TheSun --test -vv> TheSun.txt" (an example from the thread. page 117 I think) I get "Access is denied".
I'm sorry, but i need a little more guidance.
thanks again
read this
Spoiler:


to test your recipe lets say it is called marbs.recipe
at the command line (CMD)
When you are in the calibre directory for example c:/program files/calibre2

type mkdir output_dir
this will make the output directory for you. this is a one time deal

then
type ebook-convert marbs.recipe output_dir --test -vv > myrecipe.txt

this will take and put some html files in the output_dir you created. Just click on the index.html to see what your output will look like.

Then if you notice issues then you can take and use notepad or whatever and open up the myrecipe.txt file and look for all the errors and so forth that was generated by using the -vv tag

TonytheBookworm is offline  
Old 09-16-2010, 04:17 PM   #2733
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by Starson17 View Post
1) You should be doing a credit_tag.replaceWith(p) instead of span.replaceWith(p)
2) It's not sufficient to just replace your found credit_tag (the span tags) with your new Tag p. You created p from whole cloth. It has no contents, so when you did the replaceWith (or tried to) you lost everything inside the span. A Tag object is the entire tag, from <span> to </span> or <p> to </p>, with all content therebetween. Don't expect it to do a search and replace of the text string "span" with the text "p."

If you want that, similar code is often used to regex replace all <table>, <tr> and <td> with <div>. I believe that code is posted here and used in lots of (older) recipes. It's basically what linearize_tables does. Let me know if you need me to find and post it.
I'll look for it and see what I come up with. Thanks for the reply. It a least gives me an idea of what i'm doing wrong and where to go. I think I have learned a lot for you and others but there is still so much to learn which even though frustrating at times is fun all the same because i like the challenge.
again thanks
TonytheBookworm is offline  
Old 09-16-2010, 04:22 PM   #2734
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by marbs View Post
Thanks for the quick reply. So i fixed the code (it is indented now).
Still a bit lost on how to test the thing.
Opened CMD. Navigated to calibre folder. Ran "ebook-convert --list-recipes" got a list of 588 recipes. Found the recipe folder and added my recipe (the_marker.recipe) to the folder. I still have a list of only 588. Whats more, when i try "ebook-convert TheSun.recipe TheSun --test -vv> TheSun.txt" (an example from the thread. page 117 I think) I get "Access is denied".
I'm sorry, but i need a little more guidance.
thanks again
That's OK, no apology required. ebook convert should run from anywhere. The normal way to do this is to put your recipe, the_marker.recipe, into any folder, go there, then run:
Code:
ebook-convert the_marker.recipe the_marker --test -vv> the_marker.txt
The folder with 588 recipes is the builtin recipes, and you can't add any just by putting a new one in there. You have to add custom recipes inside the GUI, but you don't do that until you have it working, using the above command.
Starson17 is offline  
Old 09-16-2010, 04:27 PM   #2735
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
okay Starson17 I fold
I did a find in recipes and I can't see anything with replaceall. I seen a lot of replaceWith. But that is exactly what I'm wanting to learn how to do. Is fine either tags or text and replace it with whatever I like.
Cause lets say I come across a blog that has good content but yet they cuss a lot. The cussing doesn't bother me so much but I know people that it would. So what I would eventually like to learn how to do is use re.search(.....) and find words like boobuckle and replace it with censored or something to that nature.

Thanks
TonytheBookworm is offline  
Old 09-16-2010, 04:49 PM   #2736
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
okay Starson17 I fold
You just change the name of the tag after you find it:

Spoiler:
Code:
    def postprocess_html(self, soup, first):
        for tag in soup.findAll(name=['table', 'tr', 'td']):
            tag.name = 'div'
Starson17 is offline  
Old 09-16-2010, 04:55 PM   #2737
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
I would eventually like to learn how to do is use re.search(.....) and find words like boobuckle and replace it with censored or something to that nature.
Like this:
Code:
    preprocess_regexps = [
                (re.compile(r'boobuckle', re.DOTALL|re.IGNORECASE),lambda match: 'censored')
                         ]
Starson17 is offline  
Old 09-16-2010, 05:16 PM   #2738
marbs
Zealot
marbs began at the beginning.
 
Posts: 122
Karma: 10
Join Date: Jul 2010
Device: nook
yup. I found that message with google. ran the test. ran a full download (without --test). i have the output file. come to think of it, i will attach them both.
in the full download the 1st link of the 1st feed is not there. in other words, the article is not there.

what now?
Attached Files
File Type: zip output files.zip (496.8 KB, 242 views)

Last edited by marbs; 09-16-2010 at 05:18 PM.
marbs is offline  
Old 09-16-2010, 05:37 PM   #2739
TonytheBookworm
Addict
TonytheBookworm is on a distinguished road
 
TonytheBookworm's Avatar
 
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
Quote:
Originally Posted by Starson17 View Post
Like this:
Code:
    preprocess_regexps = [
                (re.compile(r'boobuckle', re.DOTALL|re.IGNORECASE),lambda match: 'censored')
                         ]
Wow!!! Thanks. I have been trying to figure that one out for the longest time. Also, I can't believe that simply changing the name worked so simple in your previous post. Again I appreciate your time.
TonytheBookworm is offline  
Old 09-16-2010, 05:42 PM   #2740
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
Wow!!! Thanks. I have been trying to figure that one out for the longest time. Also, I can't believe that simply changing the name worked so simple in your previous post. Again I appreciate your time.
BTW, the re.DOTALL|re.IGNORECASE options are ..... optional. The first searches past the end of the line (pretty much required) and the second is obvious.
Starson17 is offline  
Old 09-16-2010, 07:02 PM   #2741
Consti
Junior Member
Consti began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Sep 2010
Device: Kindle
Recipe for Brand Eins Wirtschaftsmagazin

That's a monthly German Magazine - pretty expensive (around 10$ per issue) BUT they keep all their articles on their (free) archive page. I've spend the last couple of hours to make my first recipe for calibre:

if just want the recipe, get it here: http://github.com/consti/BrandEins-R...andeins.recipe

repository is here: http://github.com/consti/BrandEins-Recipe

Let me know what you think!

Some screenshots are in my posterous post:
http://consti.de/brand-eins-my-first...ipe-for-my-kin

@kovidgoyal thanks for making calibre! I would be happy if you could add my recipe to the next release!

--
Consti
Consti is offline  
Old 09-16-2010, 07:06 PM   #2742
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,856
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
@Consti: done
kovidgoyal is offline  
Old 09-16-2010, 08:24 PM   #2743
nickredding
onlinenewsreader.net
nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'nickredding knows the difference between 'who' and 'whom'
 
Posts: 324
Karma: 10143
Join Date: Dec 2009
Location: Phoenix, AZ & Victoria, BC
Device: Kindle 3, Kindle Fire, IPad3, iPhone4, Playbook, HTC Inspire
Slate formatting

Quote:
Originally Posted by sdow1 View Post
Not sure if this is the right thread, but since it's where most of the content for slate seems to be, thought I'd try here first.

For the past few days, every time I download the calibre feed for slate, I just get the cover and the Table of Contents (such as it is, since it just says "all articles"). No content whatsoever. I thought I'd wait through the weekend in case it was a matter of slate itself being relatively "dead" on the weekends, but today it was the same thing. I've also tried downloading at different times of the day, in case that was a problem, but I get the same thing. Cover, but no content.

Help!!
Slate has made a minor formatting change and is causing postprocess_html to trap for every article--hence no content. The code that is trapping is actually redundant so I removed it to solve the problem. Here is my version of the recipe--it has some modest improvements over the original. Note that you can make the recipe get all content (slate_complete=True) or just the previous 7 days's worth (slate_complete=False).
Code:
#!/usr/bin/env  python
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'

__license__   = 'GPL v3'

'''
calibre recipe for slate.com
'''

import re
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, NavigableString, CData, Comment, Tag

class Slate(BasicNewsRecipe):
    # Method variables for customizing downloads
    description             = 'A general-interest publication offering analysis and commentary about politics, news and culture.'
    __author__              = 'GRiker and Sujata Raman / enhanced by Nick Redding'
    max_articles_per_feed   = 100
    oldest_article          = 14
    recursions              = 0
    delay                   = 0
    simultaneous_downloads  = 5
    timeout                 = 120.0
    timefmt                 = ''
    feeds                   = None
    no_stylesheets          = True
    encoding                = None
    language = 'en'

    slate_complete = True 
    if slate_complete:
        title = 'Slate (complete)'
    else:
        title = 'Slate (weekly)'   




    # Method variables for customizing feed parsing
    summary_length          = 250
    use_embedded_content    = None

    # Method variables for pre/post processing of HTML
    preprocess_regexps = [ (re.compile(r'<p><em>Disclosure: <strong>Slate</strong> is owned by the Washington Post.*</p>',
                                        re.DOTALL|re.IGNORECASE),
                                        lambda match: ''),
                           (re.compile(r'<p><strong><em>Join the discussion about this story on.*</p>',
                                        re.DOTALL|re.IGNORECASE),
                                        lambda match: '')   ]

    match_regexps           = []

    # The second entry is for 'Big Money', which comes from a different site, uses different markup
    keep_only_tags          = [dict(attrs={   'id':['article_top', 'article_body']}),
                               dict(attrs={   'id':['content']})  ]

    # The second entry is for 'Big Money', which comes from a different site, uses different markup
    remove_tags             = [dict(attrs={   'id':['toolbox','recommend_tab','insider_ad_wrapper',
                                                    'article_bottom_tools_cntr','fray_article_discussion','fray_article_links','bottom_sponsored_links','author_bio',
                                                    'bizbox_links_bottom','ris_links_wrapper','BOXXLE',
                                                    'comments_button','add_comments_button','comments-to-fray','marriott_ad',
                                                    'article_bottom_tools','recommend_tab2','fbog_article_bottom_cntr']}),
                               dict(attrs={    'id':['content-top','service-links-bottom','hed']})   ]

    excludedDescriptionKeywords =   ['Slate V','Twitter feed','podcast']
    excludedTitleKeywords =         ['Gabfest','Slate V','on Twitter']
    excludedAuthorKeywords =        []
    excludedContentKeywords =       ['http://twitter.com/Slate']

    extra_css = '''
                  .h1_subhead{font-family:Arial; font-size:small; }
                   h1{font-family:Verdana; font-size:large; }
                 .byline        {font-family:Georgia;   margin-bottom: 0px; }
                 .dateline      {font-family:Arial;  font-size: smaller; height: 0pt;}
                 .imagewrapper  {font-family:Verdana;font-size:x-small; }
                 .source        {font-family:Verdana; font-size:x-small;}
                 .credit        {font-family:Verdana; font-size:     smaller;}
                 #article_body  {font-family:Verdana; }
                 #content  {font-family:Arial; }
                 .caption{font-family:Verdana;font-style:italic; font-size:x-small;}
                 h3{font-family:Arial; font-size:small}
                  '''

    # Local variables to extend class
    baseURL = 'http://slate.com'
    section_dates = []

    # class extension methods
    def tag_to_strings(self, tag):
        if not tag:
            return ''
        if isinstance(tag, basestring):
            return tag
        strings = []
        for item in tag.contents:
            if isinstance(item, (NavigableString, CData)):
                strings.append(item.string)
            elif isinstance(item, Tag):
                res = self.tag_to_string(item,use_alt=False)
                if res:
                    strings.append(res)
        return strings

    def extract_named_sections(self):
        soup = self.index_to_soup( self.baseURL )
        soup_nav_bar = soup.find(True, attrs={'id':'nav'})
        briefing_nav = soup.find('li')
        briefing_url = briefing_nav.a['href']
        for section_nav in soup_nav_bar.findAll('li'):
            section_name = self.tag_to_string(section_nav,use_alt=False)
            self.section_dates.append(section_name)
            
        soup = self.index_to_soup(briefing_url)

        self.log("Briefing url = %s " % briefing_url)
        section_lists = soup.findAll('ul','view_links_list')

        sections = []
        for section in section_lists :
            sections.append(section)
        return sections


    def extract_dated_sections(self):
        soup = self.index_to_soup( self.baseURL )
        soup_top_stories = soup.find(True, attrs={'id':'tap3_cntr'})
        if soup_top_stories:
            self.section_dates.append("Top Stories")
            self.log("SELECTION TOP STORIES %s" % "Top Stories")
    
        soup = soup.find(True, attrs={'id':'toc_links_container'})

        todays_section = soup.find(True, attrs={'class':'todaydateline'})
        self.section_dates.append(self.tag_to_string(todays_section,use_alt=False))
        self.log("SELECTION DATE %s" % self.tag_to_string(todays_section,use_alt=False))

        older_section_dates = soup.findAll(True, attrs={'class':'maindateline'})
        for older_section in older_section_dates :
            self.section_dates.append(self.tag_to_string(older_section,use_alt=False))
            self.log("SELECTION DATE %s" % self.tag_to_string(older_section,use_alt=False))

        if soup_top_stories:
            headline_stories = soup_top_stories
            self.log("HAVE top_stories")
        else:
            headline_stories = None
            self.log("NO top_stories")
        section_lists = soup.findAll('ul')
        # Prepend the headlines to the first section
        if headline_stories:
            section_lists.insert(0,headline_stories)

        sections = []
        for section in section_lists :
            sections.append(section)
        return sections


    def extract_section_articles(self, sections_html) :
        # Find the containers with section content
        sections = sections_html

        articles = {}
        key = None
        ans = []

        for (i,section) in enumerate(sections) :

            # Get the section name
            if section.has_key('id') :
                self.log("PROCESSING SECTION id = %s" % section['id'])
                key = self.section_dates[i]
                if key.startswith("Pod"):
                    continue
                if key.startswith("Blog"):
                    continue
                articles[key] = []
                ans.append(key)
            elif self.slate_complete:
                key = self.section_dates[i]
                if key.startswith("Pod"):
                    continue
                if key.startswith("Blog"):
                    continue
                self.log("PROCESSING SECTION name = %s" % key)
                articles[key] = []
                ans.append(key)    
            else :
                self.log("SECTION %d HAS NO id" % i);
                continue

            # Get the section article_list
            article_list = section.findAll('li')

            # Extract the article attributes
            for article in article_list :
                bylines = self.tag_to_strings(article)
                url = article.a['href']
                title = bylines[0]
                full_title = self.tag_to_string(article,use_alt=False)
                #self.log("ARTICLE TITLE%s" % title)
                #self.log("ARTICLE FULL_TITLE%s" % full_title)
                #self.log("URL %s" % url)
                author = None
                description = None
                pubdate = None

                if len(bylines) == 2 and self.tag_to_string(article).find("Today's Papers") > 0 :
                    description = "A summary of what's in the major U.S. newspapers."

                if len(bylines) == 3 :
                    author = bylines[2].strip()
                    author = re.sub('[\r][\n][\t][\t\t]','', author)
                    author = re.sub(',','', author)
                    if bylines[1] is not None :
                        description = bylines[1]
                        full_byline = self.tag_to_string(article)
                        if full_byline.find('major U.S. newspapers') > 0 :
                            description = "A summary of what's in the major U.S. newspapers."

                if len(bylines) > 3  and author is not None:
                    author += " | "
                    for (i,substring) in enumerate(bylines[3:]) :
                        #print "substring: %s" % substring.encode('cp1252')
                        author += substring.strip()
                        if i < len(bylines[3:]) :
                            author += " | "

                # Skip articles whose descriptions contain excluded keywords
                if description is not None and len(self.excludedDescriptionKeywords):
                    excluded = re.compile('|'.join(self.excludedDescriptionKeywords))
                    found_excluded = excluded.search(description)
                    if found_excluded :
                        self.log("  >>> skipping %s (description keyword exclusion: %s) <<<\n" % (title, found_excluded.group(0)))
                        continue

                # Skip articles whose title contain excluded keywords
                if full_title is not None and len(self.excludedTitleKeywords):
                    excluded = re.compile('|'.join(self.excludedTitleKeywords))
                    #self.log("evaluating full_title: %s" % full_title)
                    found_excluded = excluded.search(full_title)
                    if found_excluded :
                        self.log("  >>> skipping %s (title keyword exclusion: %s) <<<\n" % (title, found_excluded.group(0)))
                        continue

                # Skip articles whose author contain excluded keywords
                if author is not None and len(self.excludedAuthorKeywords):
                    excluded = re.compile('|'.join(self.excludedAuthorKeywords))
                    found_excluded = excluded.search(author)
                    if found_excluded :
                        self.log("  >>> skipping %s (author keyword exclusion: %s) <<<\n" % (title, found_excluded.group(0)))
                        continue

                skip_this_article = False
                # Check to make sure we're not adding a duplicate
                for article in articles[key] :
                    if article['url'] == url :
                        skip_this_article = True
                        self.log("SKIPPING DUP %s" % url)
                        break

                if skip_this_article :
                    continue

                # Build the dictionary entry for this article
                feed = key
                if not articles.has_key(feed) :
                    articles[feed] = []
                articles[feed].append(dict(title=title, url=url, date=pubdate, description=description,
                                           author=author, content=''))
                #self.log("KEY %s" % feed)
                #self.log("APPENDED %s" % url)
            # Promote 'newspapers' to top
            for (i,article) in enumerate(articles[feed]) :
                if article['description'] is not None :
                    if article['description'].find('newspapers') > 0 :
                        articles[feed].insert(0,articles[feed].pop(i))


        ans = [(key, articles[key]) for key in ans if articles.has_key(key)]
        return ans

    def print_version(self, url) :
        return url + 'pagenum/all/'

    # Class methods
    def parse_index(self) :    
        if self.slate_complete:
            sections = self.extract_named_sections()
        else:
            sections = self.extract_dated_sections()
        section_list = self.extract_section_articles(sections)
        return section_list

    def get_browser(self) :
        return BasicNewsRecipe.get_browser()

    def get_masthead_url(self):
        masthead = 'http://img.slate.com/images/redesign2008/slate_logo.gif'
        br = BasicNewsRecipe.get_browser()
        try:
            br.open(masthead)
        except:
            self.log("\nMasthead unavailable")
            masthead = None
        return masthead

    def stripAnchors(self,soup):
        body = soup.find('div',attrs={'id':['article_body','content']})
        if body is not None:
            paras = body.findAll('p')
            if paras is not None:
                for para in paras:
                    aTags = para.findAll('a')
                    if aTags is not None:
                        for a in aTags:
                            if a.img is None:
                                #print repr(a.renderContents())
                                a.replaceWith(a.renderContents().decode('utf-8','replace'))
        return soup

    def preprocess_html(self, soup) :

        # Remove 'grayPlus4.png' images
        imgs = soup.findAll('img')
        if imgs is not None:
            for img in imgs:
                if re.search("grayPlus4.png",str(img)):
                    img.extract()

        # Delete article based upon content keywords
        if len(self.excludedDescriptionKeywords):
            excluded = re.compile('|'.join(self.excludedContentKeywords))
            found_excluded = excluded.search(str(soup))
            if found_excluded :
                print "No allowed content found, removing article"
                raise Exception('Rejected article')

        # Articles from www.thebigmoney.com use different tagging for byline, dateline and body
        head = soup.find('head')
        if head.link is not None and re.search('www\.thebigmoney\.com', str(head)):
            byline = soup.find('div',attrs={'id':'byline'})
            if byline is not None:
                byline['class'] = byline['id']

            dateline = soup.find('div',attrs={'id':'dateline'})
            if dateline is not None:
                dateline['class'] = dateline['id']

            body = soup.find('div',attrs={'id':'content'})
            if body is not None:
                body['class'] = 'article_body'

            # Synthesize a department kicker
            h3Tag = Tag(soup,'h3')
            emTag = Tag(soup,'em')
            emTag.insert(0,NavigableString("the big money: Today's business press"))
            h3Tag.insert(0,emTag)
            soup.body.insert(0,h3Tag)

        # Strip anchors from HTML
        return self.stripAnchors(soup)

    def postprocess_html(self, soup, first_fetch) :

        # Fix up dept_kicker as <h3><em>
        dept_kicker = soup.find('div', attrs={'class':'department_kicker'})
        if dept_kicker is not None :
            kicker_strings = self.tag_to_strings(dept_kicker)
            kicker = ''.join(kicker_strings[2:])
            kicker = re.sub('\.','',kicker)
            h3Tag = Tag(soup, "h3")
            emTag = Tag(soup, "em")
            emTag.insert(0,NavigableString(kicker))
            h3Tag.insert(0, emTag)
            dept_kicker.replaceWith(h3Tag)
        else:
            self.log("No kicker--return null")
            return None

       # Fix up the concatenated byline and dateline
        byline = soup.find(True,attrs={'class':'byline'})
        if byline is not None :
            bylineTag = Tag(soup,'div')
            bylineTag['class'] = 'byline'
            #bylineTag['height'] = '0em'
            bylineTag.insert(0,self.tag_to_string(byline))
            byline.replaceWith(bylineTag)

        dateline = soup.find(True, attrs={'class':'dateline'})
        if dateline is not None :
            datelineTag = Tag(soup, 'div')
            datelineTag['class'] = 'dateline'
            #datelineTag['margin-top'] = '0em'
            datelineTag.insert(0,self.tag_to_string(dateline))
            dateline.replaceWith(datelineTag)

        # Change captions to italic, add <hr>
        for caption in soup.findAll(True, {'class':'caption'}) :
            if caption is not None:
                emTag = Tag(soup, "em")
                emTag.insert(0, '<br />' + self.tag_to_string(caption))
                hrTag = Tag(soup, 'hr')
                emTag.insert(1, hrTag)
                caption.replaceWith(emTag)

        # Fix photos
        for photo in soup.findAll('span',attrs={'class':'imagewrapper'}):
            if photo.a is not None and photo.a.img is not None:
                divTag = Tag(soup,'div')
                divTag['class'] ='imagewrapper'
                divTag.insert(0,photo.a.img)
                photo.replaceWith(divTag)

        return soup

    def postprocess_book(self, oeb, opts, log) :

        def extract_byline(href) :
            soup = BeautifulSoup(str(oeb.manifest.hrefs[href]))
            byline = soup.find(True,attrs={'class':'byline'})
            if byline is not None:
                return self.tag_to_string(byline,use_alt=False)
            else :
                return None

        def extract_description(href) :
            soup = BeautifulSoup(str(oeb.manifest.hrefs[href]))
            paragraphs = soup.findAll('p')
            for p in paragraphs :
                if self.tag_to_string(p,use_alt=False).startswith('By ') or \
                   self.tag_to_string(p,use_alt=False).startswith('Posted '):
                    continue
                comment = p.find(text=lambda text:isinstance(text, Comment))
                if comment is not None:
                    continue
                else:
                    return self.tag_to_string(p,use_alt=False)[:self.summary_length] + '...'

            return None

        # Method entry point here
        # Single section toc looks different than multi-section tocs
        if oeb.toc.depth() == 2 :
            for article in oeb.toc :
                if article.author is None :
                    article.author = extract_byline(article.href)
                if article.description is None :
                    article.description = extract_description(article.href)
        elif oeb.toc.depth() == 3 :
            for section in oeb.toc :
                for article in section :
                    if article.author is None :
                        article.author = extract_byline(article.href)
                    if article.description is None :
                        article.description = extract_description(article.href)
nickredding is offline  
Old 09-17-2010, 03:02 AM   #2744
burbank_atl
Junior Member
burbank_atl began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Sep 2010
Device: Nook
Python Env

Is there a simple way to include the Calibre packages into the full Python environment, such as Python for Windows?

I wanted to step through a recipe to try and better understand the process.
burbank_atl is offline  
Old 09-17-2010, 07:55 AM   #2745
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by burbank_atl View Post
Is there a simple way to include the Calibre packages into the full Python environment, such as Python for Windows?

I wanted to step through a recipe to try and better understand the process.
Read here.
Starson17 is offline  
Closed Thread


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Custom column read ? pchrist7 Calibre 2 10-04-2010 02:52 AM
Archive for custom screensavers sleeplessdave Amazon Kindle 1 07-07-2010 12:33 PM
How to back up preferences and custom recipes? greenapple Calibre 3 03-29-2010 05:08 AM
Donations for Custom Recipes ddavtian Calibre 5 01-23-2010 04:54 PM
Help understanding custom recipes andersent Calibre 0 12-17-2009 02:37 PM


All times are GMT -4. The time now is 04:57 PM.


MobileRead.com is a privately owned, operated and funded community.