View Single Post
Old 01-30-2012, 10:01 AM   #2
a.peter
Enthusiast
a.peter began at the beginning.
 
Posts: 28
Karma: 10
Join Date: Sep 2011
Device: Sony PRS-350, Kindle Touch
Quote:
Originally Posted by sk1 View Post
I am not sure whether using a recipe (or calibre) is the best or correct way to go about this, and would love some guidance. Thanks much!
I've took a look at the web site using Firebug. I just hat to modify an old recipe for a newspaper to have it work with the Tennessee rules page.

Your idea is right. I've used the rule number an the optional text as a 'ressort' and each sub-rule as a 'article' inside the 'ressort'.

Here it is:

Spoiler:
Code:
from calibre.web.feeds.recipes import BasicNewsRecipe
import re

class TennesseeRulesCivilProcedure(BasicNewsRecipe) :
    __author__    = 'ape'
    __copyright__ = 'ape'
    __license__   = 'GPL v3'
    language      = 'de'
    description   = 'Tennessee State Courts: Rules of civil procedure'
    version       = 1
    title         = u'TN Courts: Rules of civil procedure'
    timefmt       = ' [%d.%m.%Y]' 

    no_stylesheets = True
    remove_javascript = True
    use_embedded_content = False
    publication_type = 'newspaper'
    
    keep_only_tags = [dict(name='div', attrs={'id':'main-content'})]
    
    INDEX = 'http://www.tncourts.gov/courts/supreme-court/rules/rules-civil-procedure/'
    
    def parse_index(self):
        base = 'http://www.tncourts.gov'
        ressorts = []
        articles = {}
        more = 1
        
        soup = self.index_to_soup(self.INDEX)
        
        # Get list of links to ressorts from index page
        rules = soup.findAll('table', attrs={'class': re.compile('views-table')})
        for rule in rules:
            caption = rule.findAll('caption')[0].string
            articles[caption] = []
            ressorts.append(caption)
            sub_rules = rule.findAll('td', attrs={'class': re.compile('views-field')})
            for sub_rule in sub_rules:
                article = {'title': sub_rule.contents[0].strip() + ' ' + sub_rule.a.string, 'date': u'', 'url': base + sub_rule.a['href'], 'description': sub_rule.a.string}
                articles[caption].append(article)
        answer = [(ressort, articles[ressort]) for ressort in ressorts if articles.has_key(ressort)]
        # answer structure:
        # [('genre1', [{'date': ..., 'url': ..., 'description': ..., 'title': ...},
        #              {'date': ..., 'url': ..., 'description': ..., 'title': ...}]),
        #  ('genre2', [{'date': ..., 'url': ..., 'description': ..., 'title': ...}])]
        # List[ Tuple( genre, liste[{artikel},...]), Tuple( genre, liste[{artikel},...])]
        return answer
        
    def get_masthead_url(self):
        return 'http://www.tncourts.gov/sites/all/themes/tncourts/assets/images/logo-new.png'


The file is here: TennesseeRulesCivilProcedure.recipe.txt.

I hope that you don't mind that i've done the whole recipe for you. Don't hesitate to ask any questions.
a.peter is offline   Reply With Quote