After weeks of tinkering withe the Daily Mirror recipe, I went back to the start and found auto_cleanup was doing a really good job - with a couple of exceptions
1) The articles by and date text are erased after the headline.
2)The text "Advertisement >>" is left intact.
The article source for the date is
so I thought using
auto_cleanup_keep = '//a[@class="published"]'
or
auto_cleanup_keep = '//*[@class="published"]'
would mean the date got left in - it wasn't.
I also tried
preprocess_regexps = [
(re.compile(r'Advertisement >>', re.IGNORECASE | re.DOTALL), lambda match: '')]
to just delete "Advertisement >>" so even if a class was created by calibre it would be empty. Again no success.
Is the call being ignored because autocleanup is being used?
It would be nice to fix this as the file created is smaller than my butchery and seems formatted in a cleaner way.
Here's the simplified code as it stands
Spoiler:
Code:
from calibre.web.feeds.news import BasicNewsRecipe
import re
from calibre.utils.magick import Image, PixelWand
class AdvancedUserRecipe1306061239(BasicNewsRecipe):
title = u'The Daily Mirror 2'
description = 'News as provide by The Daily Mirror -UK'
__author__ = 'Dave Asbury'
# last updated 30/10/11
language = 'en_GB'
cover_url = 'http://yookeo.com/screens/m/i/mirror.co.uk.jpg'
masthead_url = 'http://www.nmauk.co.uk/nma/images/daily_mirror.gif'
oldest_article = 2
max_articles_per_feed = 3
remove_empty_feeds = True
remove_javascript = True
no_stylesheets = True
preprocess_regexps = [
(re.compile(r'Advertisement >>', re.IGNORECASE | re.DOTALL), lambda match: '')]
extra_css = '''
body{ text-align: justify; font-family:Arial,Helvetica,sans-serif; font-size:11px; font-size-adjust:none; font-stretch:normal; font-style:normal; font-variant:normal; font-weight:normal;}
h1{ font-size:16px;}
'''
auto_cleanup = True
#auto_cleanup_keep = '//div[@class="article-attr"]'
auto_cleanup_keep = '//a[@class="published"]'
feeds = [
(u'News', u'http://www.mirror.co.uk/news/rss.xml')
# example of commented out feed not needed ,(u'Travel','http://www.mirror.co.uk/advice/travel/rss.xml')
]
def postprocess_html(self, soup, first):
#process all the images
for tag in soup.findAll(lambda tag: tag.name.lower()=='img' and tag.has_key('src')):
iurl = tag['src']
img = Image()
img.open(iurl)
if img < 0:
raise RuntimeError('Out of memory')
img.type = "GrayscaleType"
img.save(iurl)
return soup