Quote:
Originally Posted by sk1
I am not sure whether using a recipe (or calibre) is the best or correct way to go about this, and would love some guidance. Thanks much!
|
I've took a look at the web site using Firebug. I just hat to modify an old recipe for a newspaper to have it work with the Tennessee rules page.
Your idea is right. I've used the rule number an the optional text as a 'ressort' and each sub-rule as a 'article' inside the 'ressort'.
Here it is:
Spoiler:
Code:
from calibre.web.feeds.recipes import BasicNewsRecipe
import re
class TennesseeRulesCivilProcedure(BasicNewsRecipe) :
__author__ = 'ape'
__copyright__ = 'ape'
__license__ = 'GPL v3'
language = 'de'
description = 'Tennessee State Courts: Rules of civil procedure'
version = 1
title = u'TN Courts: Rules of civil procedure'
timefmt = ' [%d.%m.%Y]'
no_stylesheets = True
remove_javascript = True
use_embedded_content = False
publication_type = 'newspaper'
keep_only_tags = [dict(name='div', attrs={'id':'main-content'})]
INDEX = 'http://www.tncourts.gov/courts/supreme-court/rules/rules-civil-procedure/'
def parse_index(self):
base = 'http://www.tncourts.gov'
ressorts = []
articles = {}
more = 1
soup = self.index_to_soup(self.INDEX)
# Get list of links to ressorts from index page
rules = soup.findAll('table', attrs={'class': re.compile('views-table')})
for rule in rules:
caption = rule.findAll('caption')[0].string
articles[caption] = []
ressorts.append(caption)
sub_rules = rule.findAll('td', attrs={'class': re.compile('views-field')})
for sub_rule in sub_rules:
article = {'title': sub_rule.contents[0].strip() + ' ' + sub_rule.a.string, 'date': u'', 'url': base + sub_rule.a['href'], 'description': sub_rule.a.string}
articles[caption].append(article)
answer = [(ressort, articles[ressort]) for ressort in ressorts if articles.has_key(ressort)]
# answer structure:
# [('genre1', [{'date': ..., 'url': ..., 'description': ..., 'title': ...},
# {'date': ..., 'url': ..., 'description': ..., 'title': ...}]),
# ('genre2', [{'date': ..., 'url': ..., 'description': ..., 'title': ...}])]
# List[ Tuple( genre, liste[{artikel},...]), Tuple( genre, liste[{artikel},...])]
return answer
def get_masthead_url(self):
return 'http://www.tncourts.gov/sites/all/themes/tncourts/assets/images/logo-new.png'
The file is here:
TennesseeRulesCivilProcedure.recipe.txt.
I hope that you don't mind that i've done the whole recipe for you. Don't hesitate to ask any questions.