View Single Post
Old 05-10-2013, 12:01 AM   #3
Camper65
Enthusiast
Camper65 began at the beginning.
 
Posts: 32
Karma: 10
Join Date: Apr 2011
Device: Kindle wifi; Dell 2in1
Got it, at least it's almost a perfect recipe. Right now it still show the comments when they are there, but most of the top and bottom have been eliminated. I'd love to figure out how to use remove_tags_before when there is no id or class for it --> <header> is how the header is designated and I couldn't get it to work right, ended up using remove_tags for almost everything.

Here is the recipe at this point in time.

Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, BeautifulStoneSoup


class dotnetMagazine (BasicNewsRecipe):
    __author__ = u'Bonni Salles - post in forum if questions for me'
    __version__ = '1.0'
    __license__   = 'GPL v3'
    __copyright__ = u'2013, Bonni Salles'
    title                 = '.net '
    oldest_article        = 7
    no_stylesheets        = True
    encoding              = 'utf8'
    use_embedded_content  = False
    language              = 'en'
    remove_empty_feeds    = True
    extra_css             = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} '

    remove_tags_after = [
         dict(name='div', attrs={'class': 'footer-content'}),
          ]

    #remove_tags_before = [
    #     dict(name='div', attrs={'id': 'main-content'}),
    #     ]
          
    remove_tags = [
         dict(name='div', attrs={'class': 'item-list'}),
         dict(name=['header','footer']),
         dict(attrs={'class':re.compile('(^|| )menu($|| )', re.DOTALL)}),
         dict(name='h4', attrs={'class': 'std-hdr'}),
         dict(name=['script', 'noscript']),
         dict(name='div', attrs={'id': 'comments-form'}),
         dict(name='div', attrs={'id': re.compile('advertorial_block_($|| )')}),
         dict(name='div', attrs={'id': 'right-col'}),

         ]
#   remove_attributes = ['border', 'cellspacing', 'align', 'cellpadding', 'colspan', 'valign', 'vspace', 'hspace', #'alt', 'width', 'height', 'style']
#   extra_css = 'img {max-width: 100%; display: block; margin: auto;} .captioned-image div {text-align: center; #font-style: italic;}'


    feeds = [
               (u'net', u'http://feeds.feedburner.com/net/topstories')
            ]

Here is the area of the article that I'm trying to work with

Code:
</ul>					</nav>
                </div>

                <div id="main-content">
                  <div id="content" >
                     
                     
                                                                  
                     
                     
                     <article class="node node-news sticky" >

   <header>
                           <h1 class="title"><span>GAAD 2013 wants accessibility on web devs' minds</span></h1>
               
      <div class="submitted" >
                     By <span class="author-name">Craig Grannell</span> on <time datetime="2013-05-09T11:35:55+00:00" >May 09, 2013</time>                             <div class="item-list share-links" ><h3>Share this article</h3><ul><li class="twitter-button first"><a href="http://twitter.com/share?url=http://www.netmagazine.com/news/gaad-2013-wants-accessibility-web-devs-minds-132742&amp;text=GAAD%202013%20wants%20accessibility%20on%20web%20devs%27%20minds | .net" class="twitter-share-button" data-count="none">Tweet</a></li>
<li class="facebook-button"><iframe src="http://www.facebook.com/plugins/like.php?href=http://www.netmagazine.com/news/gaad-2013-wants-accessibility-web-devs-minds-132742&amp;send=false&amp;layout=button_count&amp;width=47&amp;show_faces=false&amp;action=like&amp;colorscheme=light&amp;font=arial&amp;height=21" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:47px; height:21px;" allowTransparency="true"></iframe></li>
<li class="googleplus-button"><g:plusone size="medium" count="false" ></g:plusone></li>
<li class="linkedin-button"><script type="in/share"  ></script> </li>
<li class="shorturl-button inactive last"><div class="shorturl-box"><input type="text" value="http://netm.ag/15MLtx1" /><div class="shorturl-close"></div></div><span class="shorturl-link">Short url</span></li>
</ul></div>              </div>
          
   </header>
   
   <div class="content">
Can someone please tell me what to do to get the remove_tags_before to work. There is also an area with <header id="header"> that is in the beginning which is not where I want to have the article start from.
Camper65 is offline   Reply With Quote