New recipe creation help

entodoays · 06-18-2014, 01:15 PM

I would like to create a recipe for a site whose rss feed does not contain the content required.

Basically, I would like a recipe that does the following:
Get the following links from the site for a 60 day period, starting from today, each time changing the date bit at the end.

Then clean the files leaving only the content.
Create an index page for each day containing links to the 7 pages.
Finally create an table of contents style page for the days.

I've never created recipes yet and am willing to learn. I would be grateful for any help.

entodoays · 06-19-2014, 08:38 AM

Since I'm new to recipe creation I started off with an exampleon the calibre documentation and started modifying it.

My first objective is to be able to get a single page to parse correctly. Then when I manage I'll try to add further steps.

My recipe looks as follows:

Code:

import string, re
from calibre import strftime
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class NYTimes(BasicNewsRecipe):

    title       = 'Liturgie des Heures'
    __author__  = 'Chris Vella'
    description = 'La liturgie des heures'
    timefmt = ' [%a, %d %b, %Y]'
    needs_subscription = False
    remove_tags_before = dict(name='h1')
    remove_tags_after = [dict(id='print_only')]
    remove_tags = [dict(attrs={'class':['articleTools', 'post-tools', 'side_tool', 'nextArticleLink clearfix']}),
                dict(id=['menuHorizontal', 'colonneDroite', 'niveau', 'don', 'font-resize', 'print_link']),
                dict(name=['script', 'noscript', 'style'])]
    encoding = 'utf8'
    no_stylesheets = True
    extra_css = 'h1 {font: sans-serif large;}\n.byline {font:monospace;}'

    def parse_index(self):
		soup = self.index_to_soup('www.aelf.org/office-messe\?desktop=1&date_my=%d/%b/%Y')

    def feed_title(div):
        return ''.join(div.findAll(text=True, recursive=False)).strip()

        articles = {}
        key = None
        ans = []
        for div in soup.findAll(True,
             attrs={'class':['current']}):

             if div['class'] == 'current':
                 key = string.capwords(feed_title(div))
                 articles[key] = []
                 ans.append(key)

             elif div['class'] in ['current']:
                 a = div.find('a', href=True)
                 if not a:
                     continue
                 url = re.sub(r'\?.*', '', a['href'])
                 url += '?pagewanted=all'
                 title = self.tag_to_string(a, use_alt=True).strip()
                 description = ''
                 pubdate = strftime('%a, %d %b')
                 summary = div.find(True, attrs={'class':'summary'})
                 if summary:
                     description = self.tag_to_string(summary, use_alt=False)

                 feed = key if key is not None else 'Uncategorized'
                 if not articles.has_key(feed):
                     articles[feed] = []
                 if not 'podcasts' in url:
                     articles[feed].append(
                               dict(title=title, url=url, date=pubdate,
                                    description=description,
                                    content=''))
        ans = self.sort_index_by(ans, {'The Front Page':-1, 'Dining In, Dining Out':1, 'Obituaries':2})
        ans = [(key, articles[key]) for key in ans if articles.has_key(key)]
        return ans

    def preprocess_html(self, soup):
        refresh = soup.find('meta', {'http-equiv':'refresh'})
        if refresh is None:
            return soup
        content = refresh.get('content').partition('=')[2]
        raw = self.browser.open('http://www.nytimes.com'+content).read()
        return BeautifulSoup(raw.decode('utf8', 'replace'))

Unfortunately I'm getting this output:

Quote:

Resolved conversion options
calibre version: 1.40.0
{'asciiize': False,
'author_sort': None,
'authors': None,
'base_font_size': 0,
'book_producer': None,
'change_justification': 'original',
'chapter': None,
'chapter_mark': 'pagebreak',
'comments': None,
'cover': None,
'debug_pipeline': u'debug',
'dehyphenate': True,
'delete_blank_paragraphs': True,
'disable_font_rescaling': False,
'dont_download_recipe': False,
'dont_split_on_page_breaks': True,
'duplicate_links_in_toc': False,
'embed_all_fonts': False,
'embed_font_family': None,
'enable_heuristics': False,
'epub_flatten': False,
'epub_inline_toc': False,
'epub_toc_at_end': False,
'expand_css': False,
'extra_css': None,
'extract_to': None,
'filter_css': None,
'fix_indents': True,
'flow_size': 260,
'font_size_mapping': None,
'format_scene_breaks': True,
'html_unwrap_factor': 0.4,
'input_encoding': None,
'input_profile': <calibre.customize.profiles.InputProfile object at 0x1ad6cd0>,
'insert_blank_line': False,
'insert_blank_line_size': 0.5,
'insert_metadata': False,
'isbn': None,
'italicize_common_cases': True,
'keep_ligatures': False,
'language': None,
'level1_toc': None,
'level2_toc': None,
'level3_toc': None,
'line_height': 0,
'linearize_tables': False,
'lrf': False,
'margin_bottom': 5.0,
'margin_left': 5.0,
'margin_right': 5.0,
'margin_top': 5.0,
'markup_chapter_headings': True,
'max_toc_links': 50,
'minimum_line_height': 120.0,
'no_chapters_in_toc': False,
'no_default_epub_cover': False,
'no_inline_navbars': False,
'no_svg_cover': False,
'output_profile': <calibre.customize.profiles.OutputProfile object at 0x1acf0d0>,
'page_breaks_before': None,
'prefer_metadata_cover': False,
'preserve_cover_aspect_ratio': False,
'pretty_print': True,
'pubdate': None,
'publisher': None,
'rating': None,
'read_metadata_from_opf': None,
'remove_fake_margins': True,
'remove_first_image': False,
'remove_paragraph_spacing': False,
'remove_paragraph_spacing_indent_size': 1.5,
'renumber_headings': True,
'replace_scene_breaks': '',
'search_replace': None,
'series': None,
'series_index': None,
'smarten_punctuation': False,
'sr1_replace': '',
'sr1_search': '',
'sr2_replace': '',
'sr2_search': '',
'sr3_replace': '',
'sr3_search': '',
'start_reading_at': None,
'subset_embedded_fonts': False,
'tags': None,
'test': (2, 2),
'timestamp': None,
'title': None,
'title_sort': None,
'toc_filter': None,
'toc_threshold': 6,
'toc_title': None,
'unsmarten_punctuation': False,
'unwrap_lines': True,
'use_auto_toc': False,
'verbose': 2}
1% Converting input to HTML...
InputFormatPlugin: Recipe Input running
Using custom recipe
1% Fetching feeds...
Traceback (most recent call last):
File "site.py", line 58, in main
File "site-packages/calibre/ebooks/conversion/cli.py", line 359, in main
File "site-packages/calibre/ebooks/conversion/plumber.py", line 1040, in run
File "site-packages/calibre/customize/conversion.py", line 241, in __call__
File "site-packages/calibre/ebooks/conversion/plugins/recipe_input.py", line 117, in convert
File "site-packages/calibre/web/feeds/news.py", line 992, in download
File "site-packages/calibre/web/feeds/news.py", line 1159, in build_index
File "site-packages/calibre/web/feeds/__init__.py", line 353, in feeds_from_index
TypeError: 'NoneType' object is not iterable

Can anybody help please? I'm not familiar with python (yet).

Thanks.

entodoays · 06-21-2014, 11:19 AM

I wrote a python script which downloads the pages and places them in separate folders by date. It creates an index file for each day. Then cleans the pages using Beautifulsoup.

Can anyone help to transform it into a recipe?

Here's the script

Code:

#!/bin/python
import datetime, os, urllib, re
from urllib import urlopen
from bs4 import BeautifulSoup
now = datetime.datetime.now() #Get today's date
os.chdir(os.environ['HOME']) #Go to home folder
Base_folder = r'Breviaire_%s-%s-%s' % (now.day, now.month, now.year) #All files will be stored in this date-stamped folder
if not os.path.exists(Base_folder): os.makedirs(Base_folder) #Create a folder with today's date
os.chdir(Base_folder) #Go to the freshly created folder
idx = (now.weekday() + 1) % 7 #Get the day of the week
Base_date = now + datetime.timedelta(7-idx) #Get this Sunday's date
next_date = Base_date
#Download the files for x days
for i in range(0, 4):
	next_folder = r'%s-%s-%s' % (next_date.year, next_date.month, next_date.day)
	if not os.path.exists(next_folder): os.makedirs(next_folder)
	os.chdir(next_folder)
	site_date = "%s/%s/%s" % (next_date.day, next_date.month, next_date.year)
	next_link = "http://www.aelf.org/office-messe?desktop=1&date_my=%s" % (site_date)
	urllib.urlretrieve(next_link, filename="0_Messe.html")
	laudes_link = "http://www.aelf.org/office-laudes?desktop=1&date_my=%s" % (site_date)
	urllib.urlretrieve(laudes_link, filename="1_Laudes.html")
	lectures_link = "http://www.aelf.org/office-lectures?desktop=1&date_my=%s" % (site_date)
	urllib.urlretrieve(lectures_link, filename="2_Lectures.html")
	tierce_link = "http://www.aelf.org/office-tierce?desktop=1&date_my=%s" % (site_date)
	urllib.urlretrieve(tierce_link, filename="3_Tierce.html")
	sexte_link = "http://www.aelf.org/office-sexte?desktop=1&date_my=%s" % (site_date)
	urllib.urlretrieve(sexte_link, filename="4_Sexte.html")
	none_link = "http://www.aelf.org/office-none?desktop=1&date_my=%s" % (site_date)
	urllib.urlretrieve(none_link, filename="5_None.html")
	vepres_link = "http://www.aelf.org/office-vepres?desktop=1&date_my=%s" % (site_date)
	urllib.urlretrieve(vepres_link, filename="6_Vepres.html")
	complies_link = "http://www.aelf.org/office-complies?desktop=1&date_my=%s" % (site_date)
	urllib.urlretrieve(complies_link, filename="7_Complies.html")
	html_doc = urlopen(next_link).read()
	#Extract ordo
	soup = BeautifulSoup(html_doc)
	ordo_text = soup.find("div", {"class": "bloc"})
	text_file = open("index.html", "w")
	for hidden in ordo_text.find_all(id='maBulle'):
		hidden.decompose()	
	part1 = """
	<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
	<html xmlns="http://www.w3.org/1999/xhtml">
	<head>
	<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
	</head>
	<body>
	"""
	part3 = """
	<div><a href="0_Messe.html">Messe</a>&nbsp;&nbsp;|&nbsp;&nbsp;
	<a href="1_Laudes.html">Laudes</a>&nbsp;&nbsp;|&nbsp;&nbsp;
	<a href="2_Lectures.html">Lectures</a>&nbsp;&nbsp;|&nbsp;&nbsp;
	<a href="3_Tierce.html">Tierce</a>&nbsp;&nbsp;|&nbsp;&nbsp;
	<a href="4_Sexte.html">Sexte</a>&nbsp;&nbsp;|&nbsp;&nbsp;
	<a href="5_None.html">None</a>&nbsp;&nbsp;|&nbsp;&nbsp;
	<a href="6_Vepres.html">Vepres</a>&nbsp;&nbsp;|&nbsp;&nbsp;
	<a href="7_Complies.html">Complies</a>
	<br><br>
	</div>
	<div style="text-align: center;"><a href="../index.html">Retour</a></div></body>
	</html>
	"""
	joined = "%s<h2>%s</h2>%s%s" % (part1, site_date, ordo_text, part3)
	text_file.write(joined)
	text_file.close()
	#Clean pages
	for filename in os.listdir('.'):
		if re.match(r'\d.*', filename):
			messy = open(filename, "r")
			soup = BeautifulSoup(messy)
			messy.close()
			for remove in soup.find_all(attrs={'class':['clr', 'goTop', 'print_only', 'change_country', 'abonnement', 'current', 'bloc', 'degre', 'base']}):
				remove.decompose()
			for remove in soup.find_all(id=['copyright', 'bas', 'menuHorizontal', 'colonneDroite', 'colonneGauche', 'font-resize', 'print_link', 'titre']):
				remove.decompose()
			cleaned = str(soup)
			output_file = open(filename, "w")
			output_file.write(cleaned)
	# Go to parent folder and add 1 day
	os.chdir("..")
	next_date = Base_date + datetime.timedelta(days=i)

kovidgoyal · 06-21-2014, 01:25 PM

You need to return a list of sections and article from parse_index() the code you posted is not returning anything. Read the API docs for parse_index() http://manual.calibre-ebook.com/news...pe.parse_index

entodoays · 06-25-2014, 11:40 AM

Thank Kovid,

My python script works independently of Calibre for the time being. It:

Creates a folder on disk
Then creates subfolders for each "section" (dates)
In each folder, it downloads seven html pages and cleans them with BeautifulSoup
Create an index page in each folder
In the base folder create an index page for all index pages

Then I can import the general index in Calibre and create an epub.

Ta transform this script into a recipe I have to change the folder creation and file downloading bits.

My question is: Is it possible to avoid using the normal "Section menu" news structure and replace it with my custom index page structure?

Please be patient; this is my first ever python script and first ever recipe.

Thanks.

The attached epub is the intended result.

kovidgoyal · 06-25-2014, 11:47 AM

If you already have a script to create your epub why do you want a recipe? Just run your script using cron and use the calibredb command to add the resulting epub to calibre.

entodoays · 06-25-2014, 11:50 AM

I want the script to be OS indepedent. I'm working on Linux but would like to share it with others on Windows or Mac.

kovidgoyal · 06-25-2014, 11:58 AM

Then you will have to figure out how to convert it to a recipe, I'm afraid I dont have the time to help you do that. The recipe API is extensively documented.

entodoays · 06-25-2014, 12:39 PM

I'm new to python and to recipe building. Therefore I'll try to create a normal news recipe because I'm not familiar with python enough.

What I would need to do to myscript is to

replace folder creation by parse.index section
and file download to article addition.

The rest should remain pretty much the same.

Can you please give me a simple recipe that creates a section with one article from a single link (http://www.aelf.org/office-lectures?...my=22/6/2014)? I think I could build up from that.

entodoays · 06-25-2014, 04:04 PM

I managed to create a recipe which downloads something. I started from the built-in recipe for "The Atlantic" and tried modifying it accordingly.

I'm trying to get the recipe to download all the links in http://www.aelf.org/office-laudes which are found in the following div:

Code:

<div class="bloc" onMouseOver="mabulle.hide()">
    <ul>
        <li class=""> > <a href="/office-messe">Lecture de la messe</a></li>
        <li class="current"> > Liturgie des heures
            <ul>
                <li class=""> > <a href="/office-lectures">Lectures</a></li>
                <li class="current"> > <a href="/office-laudes">Laudes</a></li>
                <li class=""> > <a href="/office-tierce">Tierce</a></li>
                <li class=""> > <a href="/office-sexte">Sexte</a></li>
                <li class=""> > <a href="/office-none">None</a></li>
                <li class=""> > <a href="/office-vepres">Vêpres</a></li>
                <li class=""> > <a href="/office-complies">Complies</a></li>
            </ul>
        </li>
    </ul>
</div>

For some reason it only downloads the link to /office-lectures and /office-messe. I cannot understand why.

The following is my recipe:

Code:

#!/usr/bin/env  python

__license__   = 'GPL v3'
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
'''
aelf.org Liturgie des heures
'''
import re, datetime

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

now = datetime.datetime.now() #Get today's date
idx = (now.weekday() + 1) % 7 #Get the day of the week
Base_date = now + datetime.timedelta(7-idx) #Get this Sunday's date
next_date = Base_date
site_date = "%s/%s/%s" % (next_date.day, next_date.month, next_date.year)

class AELF(BasicNewsRecipe):

    title      = 'AELF'
    __author__ = 'Kovid Goyal and Sujata Raman'
    description = 'Liturgie des Heures'
    INDEX = "http://www.aelf.org/office-laudes?desktop=1&date_my=%s" % (site_date)
    language = 'fr'

    """keep_only_tags = [{'attrs':{'class':['bloc']}}]"""
    remove_tags    = [dict(attrs={'class':['clr', 'goTop', 'print_only', 'change_country', 'abonnement', 'current', 'bloc', 'degre', 'bas']})]
    no_stylesheets = True

    def parse_index(self):
        articles = []
        soup = self.index_to_soup(self.INDEX)
        ts = soup.find('li')
        ds = self.tag_to_string(ts.find('h2')).split(':')[-1]
        self.timefmt = ' [%s]'%ds

        cover = soup.find('img', src=True, attrs={'alt':'logo de l\'association épiscopale liturgique pour les pays francophones'})

        if cover is not None:
            self.cover_url = 'http://www.aelf.org' + cover['src']
            self.log(self.cover_url)

        feeds = []
        seen_titles = set([])
        for section in soup.findAll('div', attrs={'id':'contenu'}):
            section_title = self.tag_to_string(section.find('li', attrs={'class':''}))
            self.log('Found section:', section_title)
            articles = []
            for post in section.findAll('li'):
                a = post.find('a', href=True)
                title = self.tag_to_string(a)
                """if title in seen_titles:
                    continue"""
                seen_titles.add(title)
                url = a['href']
                if url.startswith('/'):
                    url = 'http://www.aelf.org'+url+'?desktop=1&date_my=%s' % (site_date)
                p = post.parent.find('p', attrs={'class':'current'})
                desc = None
                self.log('\tFound article:', title, 'at', url)
                if p is not None:
                    desc = self.tag_to_string(p)
                    self.log('\t\t', desc)
                articles.append({'title':title, 'url':url, 'description':desc,
                    'date':''})
            if articles:
                feeds.append((section_title, articles))

        rightContent=soup.find('div', attrs={'class':['bloc']})
        for module in rightContent.findAll('li', attrs={'class':['']}):
            section_title = self.tag_to_string(INDEX.find('h1'))
            articles = []
            for post in module.findAll('div'):
                a = post.find('a', href=True)
                title = self.tag_to_string(a)
                if title in seen_titles:
                    continue
                seen_titles.add(title)
                url = a['href']
                if url.startswith('/'):
                    url = 'http://www.aelf.org'+url
                p = post.parent.find('p')
                desc = None
                self.log('\tFound article:', title, 'at', url)
                if p is not None:
                    desc = self.tag_to_string(p)
                    self.log('\t\t', desc)
                articles.append({'title':title, 'url':url, 'description':desc, 'date':''})
            if articles:
                feeds.append((section_title, articles))

        return feeds

    def postprocess_html(self, soup, first):
        for table in soup.findAll('table', align='right'):
            img = table.find('img')
            if img is not None:
                img.extract()
                caption = self.tag_to_string(table).strip()
                div = Tag(soup, 'div')
                div['style'] = 'text-align:center'
                div.insert(0, img)
                div.insert(1, Tag(soup, 'br'))
                if caption:
                    div.insert(2, NavigableString(caption))
                table.replaceWith(div)

        return soup

entodoays · 06-26-2014, 05:48 AM

I'm trying to use parse_index by giving a manual list of pages to include.

The documentation says:

Quote:

The full article (can be an empty string). Obsolete
do not use, instead save the content to a temporary
file and pass a file:///path/to/temp/file.html as
the URL.

This is a code example but don't know how to really implement it.

Code:

#One article from one page
messe = dict()
messe['title'] = 'Lectures de la Messe'
messe['url'] = 'http://www.aelf.org/office-messe?desktop=1&date_my=%s' % (site_date)
messe['date'] = site_date
messe['description'] = 'LECTURES'
#One article from another page
laudes = dict()
laudes['title'] = 'Laudes'
laudes['url'] = 'http://www.aelf.org/office-laudes?desktop=1&date_my=%s' % (site_date)
laudes['date'] = site_date
laudes['description'] = 'LAUDES'
#Create list of two articles
list_of_articles = [messe, laudes]
#Give the list a feed title
feed_title = str(book_date)
#What to give to pase_index
self = [feed_title, list_of_articles]

Any hint, please?

As an alternative:
My script does download files to disk and creates a series of index files. Can I simply pass the main index file as the index?

06-18-2014, 01:15 PM	#1
entodoays Zealot Posts: 144 Karma: 706 Join Date: Oct 2011 Device: Sony Reader PRS-T1	New recipe creation help I would like to create a recipe for a site whose rss feed does not contain the content required. Basically, I would like a recipe that does the following: Get the following links from the site for a 60 day period, starting from today, each time changing the date bit at the end. http://www.aelf.org/office-messe?des...e_my=31/1/2014 http://www.aelf.org/office-laudes?de...e_my=31/1/2014 http://www.aelf.org/office-lectures?...e_my=31/1/2014 http://www.aelf.org/office-tierce?de...e_my=31/1/2014 http://www.aelf.org/office-none?desk...e_my=31/1/2014 http://www.aelf.org/office-vepres?de...e_my=31/1/2014 http://www.aelf.org/office-complies?...e_my=31/1/2014 Then clean the files leaving only the content. Create an index page for each day containing links to the 7 pages. Finally create an table of contents style page for the days. I've never created recipes yet and am willing to learn. I would be grateful for any help. Last edited by entodoays; 06-18-2014 at 01:17 PM. Reason: Added more info

06-25-2014, 12:39 PM	#9
entodoays Zealot Posts: 144 Karma: 706 Join Date: Oct 2011 Device: Sony Reader PRS-T1	I'm new to python and to recipe building. Therefore I'll try to create a normal news recipe because I'm not familiar with python enough. What I would need to do to myscript is to replace folder creation by parse.index section and file download to article addition. The rest should remain pretty much the same. Can you please give me a simple recipe that creates a section with one article from a single link (http://www.aelf.org/office-lectures?...my=22/6/2014)? I think I could build up from that. Last edited by entodoays; 06-25-2014 at 01:23 PM. Reason: Asking for help

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
TOC creation	Brock Benjamin	Conversion	1	09-14-2011 09:08 PM
Recipe works when mocked up as Python file, fails when converted to Recipe	ode	Recipes	7	09-04-2011 04:57 AM
need your help ,creation ottoman..	bookmania	Amazon Kindle	5	05-23-2011 02:36 AM
TOC Creation	taraboom11	Conversion	7	02-08-2011 12:40 AM
PDF creation	jsfiller	enTourage Archive	4	09-26-2010 10:46 PM

06-21-2014, 01:25 PM	#4
kovidgoyal creator of calibre Posts: 45,342 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You need to return a list of sections and article from parse_index() the code you posted is not returning anything. Read the API docs for parse_index() http://manual.calibre-ebook.com/news...pe.parse_index

06-25-2014, 11:47 AM	#6
kovidgoyal creator of calibre Posts: 45,342 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	If you already have a script to create your epub why do you want a recipe? Just run your script using cron and use the calibredb command to add the resulting epub to calibre.

06-25-2014, 11:50 AM	#7
entodoays Zealot Posts: 144 Karma: 706 Join Date: Oct 2011 Device: Sony Reader PRS-T1	I want the script to be OS indepedent. I'm working on Linux but would like to share it with others on Windows or Mac.

06-25-2014, 11:58 AM	#8
kovidgoyal creator of calibre Posts: 45,342 Karma: 27182818 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Then you will have to figure out how to convert it to a recipe, I'm afraid I dont have the time to help you do that. The recipe API is extensively documented.

Advert

Advert