Custom recipes (archive, read-only) - Page 91

kiklop74 · 02-04-2010, 08:13 AM

New recipe for digital spy UK:

Denny_ · 02-04-2010, 08:21 AM

keep_only_tags = [dict(attrs={'class':['print-title','print-subtitle','print-author','print-date-issue','print-content']})]

I put this in the recipe and it worked very nicely. However, the author and date are not coming through. Do I need to add something else?

Denny

kiklop74 · 02-04-2010, 08:26 AM

OK try this one:

Code:

keep_only_tags = [dict(attrs={'class':['print-title','print-subtitle','print-author','author','print-date','print-date-issue','print-content']})]

Denny_ · 02-04-2010, 08:52 AM

Brilliant. That worked. Thank you.

BTW, what's the best method to capture the cover image when the url changes each time. In this case the url includes the volume number, issue number, and the date.

Denny

DoctorOhh · 02-04-2010, 09:07 AM

Quote:

Originally Posted by Denny_

Brilliant. That worked. Thank you.

BTW, what's the best method to capture the cover image when the url changes each time. In this case the url includes the volume number, issue number, and the date.

Denny

This isn't a cover but I think this will give you a nice masthead for your Kindle.

Code:

    masthead_url = 'http://www.weeklystandard.com/sites/all/themes/weeklystandard/images/logo_red.png'

Denny_ · 02-04-2010, 10:10 AM

I had included "print-logo" in the recipe that shows at the beginning of each article but that's a nice way to just include it at the beginning on the Kindle.

Thanks,

Denny

DoctorOhh · 02-04-2010, 10:22 AM

Quote:

Originally Posted by Denny_

I had included "print-logo" in the recipe that shows at the beginning of each article but that's a nice way to just include it at the beginning on the Kindle.

Thanks,

Denny

When you zip it up to send to this forum include the icon in the zip. I've attached it for you.

gianfri · 02-04-2010, 01:21 PM

Hello,

I am totally new to the ebook world and try to learn. I would like to have a recipe for the Topeka Capital Journal (http://cjonline.com/). I tried the "easy" way but all I can get is garbage. Thank you for any help you can provide!

Gianfranco

kiklop74 · 02-04-2010, 01:46 PM

New recipe for Topeka Journal:

Denny_ · 02-04-2010, 02:37 PM

Walt,

1. why include the icon
2. I'm having trouble copying my recipe from calibre to Notepad. The indents change and the recipe won't work when it's copied back to calibre.

Denny

nickredding · 02-04-2010, 04:25 PM

Recipe for The Register -- a UK Information Technology news site.

Code:

#!/usr/bin/env  python
__license__   = 'GPL v3'
__copyright__ = '2010, Nick Redding'
'''
www.theregister.co.uk
'''
import string, re
from calibre import strftime
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup
from datetime import timedelta, datetime, date


class TheRegister(BasicNewsRecipe):
    title = u'The Register'
    language = 'en_GB'
    __author__ = 'Nick Redding'
    oldest_article = 2
    timefmt = '' # '[%b %d]'
    needs_subscription = False
    keep_only_tags = [dict(name='div', attrs={'id':'article'})]
    #remove_tags_before = []
    remove_tags = [
		{'id':['related-stories','ad-mpu1-spot'] },
		{'class':['orig-url','article-nav','wptl btm','wptl top']}
		]
    #remove_tags_after = []

    no_stylesheets = True
    extra_css = '''
                h2 {font-size: x-large; }
                h3 {font-size: large; font-weight: bold; }
                .byline {font-size: x-small; }
                .dateline {font-size: x-small; }
                '''
    def get_browser(self):
        br = BasicNewsRecipe.get_browser()
        return br

    def get_masthead_url(self):
        masthead = 'http://www.theregister.co.uk/Design/graphics/std/logo_414_80.png'
        br = BasicNewsRecipe.get_browser()
        try:
            br.open(masthead)
        except:
            self.log("\nMasthead unavailable")
            masthead = None
        return masthead

    def preprocess_html(self,soup):
        # this removes the explicit url after links
        for span_tag in soup.findAll('span','URL'):
            span_tag.previous.replaceWith(re.sub("\ \($","",self.tag_to_string(span_tag.previous)))
            span_tag.next.next.replaceWith(re.sub("^\)","",self.tag_to_string(span_tag.next.next)))
            span_tag.extract()
        return soup
                                   

    def parse_index(self):

        def decode_date(datestr):
            udate = datestr.strip().lower().split()
            m = ['jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec'].index(udate[1])+1
            d = int(udate[0])
            y = date.today().year
            return date(y,m,d)


        articles = {}
        key = None
        ans = []

        def parse_index_page(page_name,page_title):

            def article_title(tag):
                atag = tag.find('a',href=True)
                return ''.join(atag.findAll(text=True, recursive=False)).strip()

            def article_date(tag):
                t = tag.find(True, {'class' : 'date'})
                if t:
                    return ''.join(t.findAll(text=True, recursive=False)).strip()
                return ''

            def article_summary(tag):
                t = tag.find(True, {'class' : 'standfirst'})
                if t:
                    return ''.join(t.findAll(text=True, recursive=False)).strip()
                return ''

            def article_url(tag):
                atag = tag.find('a',href=True)
                url = atag['href']
                return url

            mainurl = 'http://www.theregister.co.uk'
            soup = self.index_to_soup(mainurl+page_name)
            # Find each instance of class="section-headline", class="story", class="story headline"
            for div in soup.findAll('div',attrs={'class':re.compile('^story-ref')}):
                # div contains all article data

                # check if article is too old
                datetag = div.find('span','date')
                if datetag:
                    dateline_string = self.tag_to_string(datetag,False)
                    a_date = decode_date(dateline_string)
                    earliest_date = date.today() - timedelta(days=self.oldest_article)
                    if a_date < earliest_date:
                        self.log("Skipping article dated %s" % dateline_string)
                        continue


                url = article_url(div)
                if 'http' in url:
                    continue
                url = mainurl + url + 'print.html'
                self.log("URL %s" % url)
                title = article_title(div)
                self.log("Title %s" % title)
                pubdate = article_date(div)
                self.log("Date %s" % pubdate)
                description = article_summary(div)
                self.log("Description %s" % description)
                author = ''
                if not articles.has_key(page_title):
                    articles[page_title] = []
                articles[page_title].append(dict(title=title,url=url,date=pubdate,description=description,author=author,content=''))


        parse_index_page('','Front Page')
        ans.append('Front Page')
        parse_index_page('/hardware','Hardware')
        ans.append('Hardware')
        parse_index_page('/software','Software')
        ans.append('Software')
        parse_index_page('/music_media','Music & Media')
        ans.append('Music & Media')
        parse_index_page('/networks','Networks')
        ans.append('Networks')
        parse_index_page('/security','Security')
        ans.append('Security')
        parse_index_page('/public_sector','Public Sector')
        ans.append('Public Sector')
        parse_index_page('/business','Business')
        ans.append('Business')
        parse_index_page('/science','Science')
        ans.append('Science')
        parse_index_page('/odds','Odds & Sods')
        ans.append('Odds & Sods')
        ans = [(key, articles[key]) for key in ans if articles.has_key(key)]
        return ans

gianfri · 02-04-2010, 04:26 PM

Quote:

Originally Posted by kiklop74

New recipe for Topeka Journal:

Just amazing, for a newbie like me. Thanks!!

DoctorOhh · 02-04-2010, 07:20 PM

Quote:

Originally Posted by Denny_

Walt,

1. why include the icon
2. I'm having trouble copying my recipe from calibre to Notepad. The indents change and the recipe won't work when it's copied back to calibre.

Denny

I use notepad++ it's free and will keep the spaces. Although sometimes it puts in a tab instead of spaces.

You can just paste the code in a post and wrap it in code tags (the #).

JIGACE · 02-04-2010, 07:22 PM

Quote:

Originally Posted by kiklop74

New recipe for Read It Later website:

thanks for the recipe I was looking for one for this site, I tried to do it myself but I dont know nothing about programming... just 2 questions, how do I change the default image? and its there a way to show the pictures of the snips saved on read it later (retrieves only text) thank you.,

srvean · 02-04-2010, 09:18 PM

Quote:

Originally Posted by kiklop74

You can accomplish that task by using instapaper.com. Calibre has a recipe for that site. Go to the website, register and start adding articles you want to read. Once you are ready download them using calibre instapaper recipe. No coding involved at all.

Thanks for the tip & it works 70% of the time. Problem is with RSS feeds. Occasionally I want to use RSS feed from a Blog or a discussion board and my fetch may not repeat more than once. Instapaper solution on RSS feed will not work as I cannot ask Calibre to do a recessive get from Instapaper recipe.

02-04-2010, 08:26 AM	#1353
kiklop74 Guru Posts: 800 Karma: 194644 Join Date: Dec 2007 Location: Argentina Device: Kindle Voyage	OK try this one: Code: keep_only_tags = [dict(attrs={'class':['print-title','print-subtitle','print-author','author','print-date','print-date-issue','print-content']})]

02-04-2010, 01:21 PM	#1358
gianfri Connoisseur Posts: 59 Karma: 4212 Join Date: Feb 2010 Device: Sony	Topeka Capital Journal recipe Hello, I am totally new to the ebook world and try to learn. I would like to have a recipe for the Topeka Capital Journal (http://cjonline.com/). I tried the "easy" way but all I can get is garbage. Thank you for any help you can provide! Gianfranco

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 02:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 12:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 05:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 04:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 02:37 PM

02-04-2010, 08:21 AM	#1352
Denny_ Member Posts: 12 Karma: 42 Join Date: Jan 2010 Device: Kindle	keep_only_tags = [dict(attrs={'class':['print-title','print-subtitle','print-author','print-date-issue','print-content']})] I put this in the recipe and it worked very nicely. However, the author and date are not coming through. Do I need to add something else? Denny

02-04-2010, 08:52 AM	#1354
Denny_ Member Posts: 12 Karma: 42 Join Date: Jan 2010 Device: Kindle	Brilliant. That worked. Thank you. BTW, what's the best method to capture the cover image when the url changes each time. In this case the url includes the volume number, issue number, and the date. Denny

02-04-2010, 10:10 AM	#1356
Denny_ Member Posts: 12 Karma: 42 Join Date: Jan 2010 Device: Kindle	I had included "print-logo" in the recipe that shows at the beginning of each article but that's a nice way to just include it at the beginning on the Kindle. Thanks, Denny

02-04-2010, 02:37 PM	#1360
Denny_ Member Posts: 12 Karma: 42 Join Date: Jan 2010 Device: Kindle	Walt, 1. why include the icon 2. I'm having trouble copying my recipe from calibre to Notepad. The indents change and the recipe won't work when it's copied back to calibre. Denny