Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 11-06-2011, 09:36 AM   #1
scissors
Addict
scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.
 
Posts: 241
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
using auto_cleanup and manual clean up together

After weeks of tinkering withe the Daily Mirror recipe, I went back to the start and found auto_cleanup was doing a really good job - with a couple of exceptions

1) The articles by and date text are erased after the headline.
2)The text "Advertisement >>" is left intact.

The article source for the date is

Spoiler:
<h1> UK's IMF contribution limit '&pound;40bn' </h1>
<div class="article-attr">

<div class="byline append-1">


<a class="published" href="http://www.mirror.co.uk/news/latest/2011/11/06/"
title="Find all articles published on 6/11/2011 to the Latest section">
6/11/2011
</a>

</div>


so I thought using

auto_cleanup_keep = '//a[@class="published"]'

or

auto_cleanup_keep = '//*[@class="published"]'

would mean the date got left in - it wasn't.

I also tried

preprocess_regexps = [
(re.compile(r'Advertisement >>', re.IGNORECASE | re.DOTALL), lambda match: '')]

to just delete "Advertisement >>" so even if a class was created by calibre it would be empty. Again no success.

Is the call being ignored because autocleanup is being used?

It would be nice to fix this as the file created is smaller than my butchery and seems formatted in a cleaner way.

Here's the simplified code as it stands

Spoiler:
Code:
from calibre.web.feeds.news import BasicNewsRecipe
import re
from calibre.utils.magick import Image, PixelWand
class AdvancedUserRecipe1306061239(BasicNewsRecipe):
    title          = u'The Daily Mirror 2'
    description = 'News as provide by The Daily Mirror -UK'

    __author__ = 'Dave Asbury'
    # last updated 30/10/11
    language = 'en_GB'

    cover_url = 'http://yookeo.com/screens/m/i/mirror.co.uk.jpg'

    masthead_url = 'http://www.nmauk.co.uk/nma/images/daily_mirror.gif'


    oldest_article = 2
    max_articles_per_feed = 3
    remove_empty_feeds = True
    remove_javascript     = True
    no_stylesheets = True
    
    preprocess_regexps = [
    (re.compile(r'Advertisement >>', re.IGNORECASE | re.DOTALL), lambda match: '')]
    
    extra_css  = '''
	body{ text-align: justify; font-family:Arial,Helvetica,sans-serif; font-size:11px; font-size-adjust:none; font-stretch:normal; font-style:normal; font-variant:normal; font-weight:normal;}
                    h1{ font-size:16px;}
                   	'''
    auto_cleanup = True
    #auto_cleanup_keep = '//div[@class="article-attr"]'
    auto_cleanup_keep = '//a[@class="published"]'
    
    
    
    feeds          = [

        (u'News', u'http://www.mirror.co.uk/news/rss.xml')
       
           # example of commented out feed not needed ,(u'Travel','http://www.mirror.co.uk/advice/travel/rss.xml')
  ]

    def postprocess_html(self, soup, first):
        #process all the images
        for tag in soup.findAll(lambda tag: tag.name.lower()=='img' and tag.has_key('src')):
            iurl = tag['src']
            img = Image()
            img.open(iurl)
            if img < 0:
                raise RuntimeError('Out of memory')
            img.type = "GrayscaleType"
            img.save(iurl)
        return soup
scissors is offline   Reply With Quote
Old 11-06-2011, 10:08 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
auto_cleanup_keep will typically fail if you put it on a low level element like an <a> tag. Instead find the <div> the a is in and try keeping that.
kovidgoyal is offline   Reply With Quote
Old 11-06-2011, 11:06 AM   #3
scissors
Addict
scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.
 
Posts: 241
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
Quote:
Originally Posted by kovidgoyal View Post
auto_cleanup_keep will typically fail if you put it on a low level element like an <a> tag. Instead find the <div> the a is in and try keeping that.
Hi Kovid.

I tried the div above, and it's parent and both together. no good.

Also I thought that the use of the * as in
auto_cleanup_keep = '//*[@class="important"]'

meant all elements would be saved regardless of the tag it's attached to.

ALso, is

preprocess_regexps = [
(re.compile(r'Advertisement >>', re.IGNORECASE | re.DOTALL), lambda match: '')]

not deleting instances of "Advertisement >>" because auto clean up overides it?

Can you do auto clean up followed by manual for any stray elements that get through.

+++++++++++++++++

BTW the whole reason Ive gone down this path is I discovered the text/paragraph after the first image in an article is being displayed to the right of the image (in the original Daily Mirror recipe).On my prs300 it's getting "displayed" off screen. I can't find a method to insert a crlf after the image/ before the image caption.
scissors is offline   Reply With Quote
Old 11-06-2011, 01:32 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
stick

img { display:block}

in the extra_css

and set

conversion_options = { 'linearize_tables' : True }
kovidgoyal is offline   Reply With Quote
Old 11-06-2011, 02:35 PM   #5
scissors
Addict
scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.scissors ought to be getting tired of karma fortunes by now.
 
Posts: 241
Karma: 1001369
Join Date: Sep 2010
Device: prs300, kindle keyboard 3g
Quote:
Originally Posted by kovidgoyal View Post
stick

img { display:block}

in the extra_css

and set

conversion_options = { 'linearize_tables' : True }
no. still the same
scissors is offline   Reply With Quote
Old 11-06-2011, 09:13 PM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Use the --debug-pipeline option and post one of the downloaded html files that display this issue (alo add no_stylesheets = True) to your recipe.
kovidgoyal is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Short Fiction Martinez, Brian: A Good Clean, A Harsh Clean. v1. PDF, 13th Dec 2010 BrianMartinez Other Books 0 12-13-2010 09:27 PM
Short Fiction Martinez, Brian: A Good Clean, A Harsh Clean. v1. 13th Dec 2010 BrianMartinez Kindle Books 0 12-13-2010 09:25 PM
Short Fiction Martinez, Brian: A Good Clean, A Harsh Clean. v1. 13th Dec 2010 BrianMartinez ePub Books 0 12-13-2010 09:23 PM
The best way to clean a white PP? Dr. Drib Astak EZReader 6 02-10-2010 02:26 AM
How to clean lightwedge PsyDocJoanne Sony Reader 9 10-01-2008 07:03 PM


All times are GMT -4. The time now is 07:26 PM.


MobileRead.com is a privately owned, operated and funded community.