Got it, at least it's almost a perfect recipe. Right now it still show the comments when they are there, but most of the top and bottom have been eliminated. I'd love to figure out how to use remove_tags_before when there is no id or class for it --> <header> is how the header is designated and I couldn't get it to work right, ended up using remove_tags for almost everything.
Here is the recipe at this point in time.
Code:
from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup, Tag, BeautifulStoneSoup
class dotnetMagazine (BasicNewsRecipe):
__author__ = u'Bonni Salles - post in forum if questions for me'
__version__ = '1.0'
__license__ = 'GPL v3'
__copyright__ = u'2013, Bonni Salles'
title = '.net '
oldest_article = 7
no_stylesheets = True
encoding = 'utf8'
use_embedded_content = False
language = 'en'
remove_empty_feeds = True
extra_css = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} '
remove_tags_after = [
dict(name='div', attrs={'class': 'footer-content'}),
]
#remove_tags_before = [
# dict(name='div', attrs={'id': 'main-content'}),
# ]
remove_tags = [
dict(name='div', attrs={'class': 'item-list'}),
dict(name=['header','footer']),
dict(attrs={'class':re.compile('(^|| )menu($|| )', re.DOTALL)}),
dict(name='h4', attrs={'class': 'std-hdr'}),
dict(name=['script', 'noscript']),
dict(name='div', attrs={'id': 'comments-form'}),
dict(name='div', attrs={'id': re.compile('advertorial_block_($|| )')}),
dict(name='div', attrs={'id': 'right-col'}),
]
# remove_attributes = ['border', 'cellspacing', 'align', 'cellpadding', 'colspan', 'valign', 'vspace', 'hspace', #'alt', 'width', 'height', 'style']
# extra_css = 'img {max-width: 100%; display: block; margin: auto;} .captioned-image div {text-align: center; #font-style: italic;}'
feeds = [
(u'net', u'http://feeds.feedburner.com/net/topstories')
]
Here is the area of the article that I'm trying to work with
Code:
</ul> </nav>
</div>
<div id="main-content">
<div id="content" >
<article class="node node-news sticky" >
<header>
<h1 class="title"><span>GAAD 2013 wants accessibility on web devs' minds</span></h1>
<div class="submitted" >
By <span class="author-name">Craig Grannell</span> on <time datetime="2013-05-09T11:35:55+00:00" >May 09, 2013</time> <div class="item-list share-links" ><h3>Share this article</h3><ul><li class="twitter-button first"><a href="http://twitter.com/share?url=http://www.netmagazine.com/news/gaad-2013-wants-accessibility-web-devs-minds-132742&text=GAAD%202013%20wants%20accessibility%20on%20web%20devs%27%20minds | .net" class="twitter-share-button" data-count="none">Tweet</a></li>
<li class="facebook-button"><iframe src="http://www.facebook.com/plugins/like.php?href=http://www.netmagazine.com/news/gaad-2013-wants-accessibility-web-devs-minds-132742&send=false&layout=button_count&width=47&show_faces=false&action=like&colorscheme=light&font=arial&height=21" scrolling="no" frameborder="0" style="border:none; overflow:hidden; width:47px; height:21px;" allowTransparency="true"></iframe></li>
<li class="googleplus-button"><g:plusone size="medium" count="false" ></g:plusone></li>
<li class="linkedin-button"><script type="in/share" ></script> </li>
<li class="shorturl-button inactive last"><div class="shorturl-box"><input type="text" value="http://netm.ag/15MLtx1" /><div class="shorturl-close"></div></div><span class="shorturl-link">Short url</span></li>
</ul></div> </div>
</header>
<div class="content">
Can someone please tell me what to do to get the remove_tags_before to work. There is also an area with <header id="header"> that is in the beginning which is not where I want to have the article start from.