Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 06-18-2014, 01:15 PM   #1
entodoays
Zealot
entodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enough
 
entodoays's Avatar
 
Posts: 144
Karma: 706
Join Date: Oct 2011
Device: Sony Reader PRS-T1
New recipe creation help

I would like to create a recipe for a site whose rss feed does not contain the content required.

Basically, I would like a recipe that does the following:
Get the following links from the site for a 60 day period, starting from today, each time changing the date bit at the end.
  1. http://www.aelf.org/office-messe?des...e_my=31/1/2014
  2. http://www.aelf.org/office-laudes?de...e_my=31/1/2014
  3. http://www.aelf.org/office-lectures?...e_my=31/1/2014
  4. http://www.aelf.org/office-tierce?de...e_my=31/1/2014
  5. http://www.aelf.org/office-none?desk...e_my=31/1/2014
  6. http://www.aelf.org/office-vepres?de...e_my=31/1/2014
  7. http://www.aelf.org/office-complies?...e_my=31/1/2014
Then clean the files leaving only the content.
Create an index page for each day containing links to the 7 pages.
Finally create an table of contents style page for the days.

I've never created recipes yet and am willing to learn. I would be grateful for any help.

Last edited by entodoays; 06-18-2014 at 01:17 PM. Reason: Added more info
entodoays is offline   Reply With Quote
Old 06-19-2014, 08:38 AM   #2
entodoays
Zealot
entodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enough
 
entodoays's Avatar
 
Posts: 144
Karma: 706
Join Date: Oct 2011
Device: Sony Reader PRS-T1
A start

Since I'm new to recipe creation I started off with an exampleon the calibre documentation and started modifying it.

My first objective is to be able to get a single page to parse correctly. Then when I manage I'll try to add further steps.

My recipe looks as follows:
Code:
import string, re
from calibre import strftime
from calibre.web.feeds.recipes import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import BeautifulSoup

class NYTimes(BasicNewsRecipe):

    title       = 'Liturgie des Heures'
    __author__  = 'Chris Vella'
    description = 'La liturgie des heures'
    timefmt = ' [%a, %d %b, %Y]'
    needs_subscription = False
    remove_tags_before = dict(name='h1')
    remove_tags_after = [dict(id='print_only')]
    remove_tags = [dict(attrs={'class':['articleTools', 'post-tools', 'side_tool', 'nextArticleLink clearfix']}),
                dict(id=['menuHorizontal', 'colonneDroite', 'niveau', 'don', 'font-resize', 'print_link']),
                dict(name=['script', 'noscript', 'style'])]
    encoding = 'utf8'
    no_stylesheets = True
    extra_css = 'h1 {font: sans-serif large;}\n.byline {font:monospace;}'

    def parse_index(self):
		soup = self.index_to_soup('www.aelf.org/office-messe\?desktop=1&date_my=%d/%b/%Y')

    def feed_title(div):
        return ''.join(div.findAll(text=True, recursive=False)).strip()

        articles = {}
        key = None
        ans = []
        for div in soup.findAll(True,
             attrs={'class':['current']}):

             if div['class'] == 'current':
                 key = string.capwords(feed_title(div))
                 articles[key] = []
                 ans.append(key)

             elif div['class'] in ['current']:
                 a = div.find('a', href=True)
                 if not a:
                     continue
                 url = re.sub(r'\?.*', '', a['href'])
                 url += '?pagewanted=all'
                 title = self.tag_to_string(a, use_alt=True).strip()
                 description = ''
                 pubdate = strftime('%a, %d %b')
                 summary = div.find(True, attrs={'class':'summary'})
                 if summary:
                     description = self.tag_to_string(summary, use_alt=False)

                 feed = key if key is not None else 'Uncategorized'
                 if not articles.has_key(feed):
                     articles[feed] = []
                 if not 'podcasts' in url:
                     articles[feed].append(
                               dict(title=title, url=url, date=pubdate,
                                    description=description,
                                    content=''))
        ans = self.sort_index_by(ans, {'The Front Page':-1, 'Dining In, Dining Out':1, 'Obituaries':2})
        ans = [(key, articles[key]) for key in ans if articles.has_key(key)]
        return ans

    def preprocess_html(self, soup):
        refresh = soup.find('meta', {'http-equiv':'refresh'})
        if refresh is None:
            return soup
        content = refresh.get('content').partition('=')[2]
        raw = self.browser.open('http://www.nytimes.com'+content).read()
        return BeautifulSoup(raw.decode('utf8', 'replace'))
Unfortunately I'm getting this output:
Quote:
Resolved conversion options
calibre version: 1.40.0
{'asciiize': False,
'author_sort': None,
'authors': None,
'base_font_size': 0,
'book_producer': None,
'change_justification': 'original',
'chapter': None,
'chapter_mark': 'pagebreak',
'comments': None,
'cover': None,
'debug_pipeline': u'debug',
'dehyphenate': True,
'delete_blank_paragraphs': True,
'disable_font_rescaling': False,
'dont_download_recipe': False,
'dont_split_on_page_breaks': True,
'duplicate_links_in_toc': False,
'embed_all_fonts': False,
'embed_font_family': None,
'enable_heuristics': False,
'epub_flatten': False,
'epub_inline_toc': False,
'epub_toc_at_end': False,
'expand_css': False,
'extra_css': None,
'extract_to': None,
'filter_css': None,
'fix_indents': True,
'flow_size': 260,
'font_size_mapping': None,
'format_scene_breaks': True,
'html_unwrap_factor': 0.4,
'input_encoding': None,
'input_profile': <calibre.customize.profiles.InputProfile object at 0x1ad6cd0>,
'insert_blank_line': False,
'insert_blank_line_size': 0.5,
'insert_metadata': False,
'isbn': None,
'italicize_common_cases': True,
'keep_ligatures': False,
'language': None,
'level1_toc': None,
'level2_toc': None,
'level3_toc': None,
'line_height': 0,
'linearize_tables': False,
'lrf': False,
'margin_bottom': 5.0,
'margin_left': 5.0,
'margin_right': 5.0,
'margin_top': 5.0,
'markup_chapter_headings': True,
'max_toc_links': 50,
'minimum_line_height': 120.0,
'no_chapters_in_toc': False,
'no_default_epub_cover': False,
'no_inline_navbars': False,
'no_svg_cover': False,
'output_profile': <calibre.customize.profiles.OutputProfile object at 0x1acf0d0>,
'page_breaks_before': None,
'prefer_metadata_cover': False,
'preserve_cover_aspect_ratio': False,
'pretty_print': True,
'pubdate': None,
'publisher': None,
'rating': None,
'read_metadata_from_opf': None,
'remove_fake_margins': True,
'remove_first_image': False,
'remove_paragraph_spacing': False,
'remove_paragraph_spacing_indent_size': 1.5,
'renumber_headings': True,
'replace_scene_breaks': '',
'search_replace': None,
'series': None,
'series_index': None,
'smarten_punctuation': False,
'sr1_replace': '',
'sr1_search': '',
'sr2_replace': '',
'sr2_search': '',
'sr3_replace': '',
'sr3_search': '',
'start_reading_at': None,
'subset_embedded_fonts': False,
'tags': None,
'test': (2, 2),
'timestamp': None,
'title': None,
'title_sort': None,
'toc_filter': None,
'toc_threshold': 6,
'toc_title': None,
'unsmarten_punctuation': False,
'unwrap_lines': True,
'use_auto_toc': False,
'verbose': 2}
1% Converting input to HTML...
InputFormatPlugin: Recipe Input running
Using custom recipe
1% Fetching feeds...
Traceback (most recent call last):
File "site.py", line 58, in main
File "site-packages/calibre/ebooks/conversion/cli.py", line 359, in main
File "site-packages/calibre/ebooks/conversion/plumber.py", line 1040, in run
File "site-packages/calibre/customize/conversion.py", line 241, in __call__
File "site-packages/calibre/ebooks/conversion/plugins/recipe_input.py", line 117, in convert
File "site-packages/calibre/web/feeds/news.py", line 992, in download
File "site-packages/calibre/web/feeds/news.py", line 1159, in build_index
File "site-packages/calibre/web/feeds/__init__.py", line 353, in feeds_from_index
TypeError: 'NoneType' object is not iterable
Can anybody help please? I'm not familiar with python (yet).

Thanks.
entodoays is offline   Reply With Quote
Advert
Old 06-21-2014, 11:19 AM   #3
entodoays
Zealot
entodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enough
 
entodoays's Avatar
 
Posts: 144
Karma: 706
Join Date: Oct 2011
Device: Sony Reader PRS-T1
Working python script

I wrote a python script which downloads the pages and places them in separate folders by date. It creates an index file for each day. Then cleans the pages using Beautifulsoup.

Can anyone help to transform it into a recipe?

Here's the script

Code:
#!/bin/python
import datetime, os, urllib, re
from urllib import urlopen
from bs4 import BeautifulSoup
now = datetime.datetime.now() #Get today's date
os.chdir(os.environ['HOME']) #Go to home folder
Base_folder = r'Breviaire_%s-%s-%s' % (now.day, now.month, now.year) #All files will be stored in this date-stamped folder
if not os.path.exists(Base_folder): os.makedirs(Base_folder) #Create a folder with today's date
os.chdir(Base_folder) #Go to the freshly created folder
idx = (now.weekday() + 1) % 7 #Get the day of the week
Base_date = now + datetime.timedelta(7-idx) #Get this Sunday's date
next_date = Base_date
#Download the files for x days
for i in range(0, 4):
	next_folder = r'%s-%s-%s' % (next_date.year, next_date.month, next_date.day)
	if not os.path.exists(next_folder): os.makedirs(next_folder)
	os.chdir(next_folder)
	site_date = "%s/%s/%s" % (next_date.day, next_date.month, next_date.year)
	next_link = "http://www.aelf.org/office-messe?desktop=1&date_my=%s" % (site_date)
	urllib.urlretrieve(next_link, filename="0_Messe.html")
	laudes_link = "http://www.aelf.org/office-laudes?desktop=1&date_my=%s" % (site_date)
	urllib.urlretrieve(laudes_link, filename="1_Laudes.html")
	lectures_link = "http://www.aelf.org/office-lectures?desktop=1&date_my=%s" % (site_date)
	urllib.urlretrieve(lectures_link, filename="2_Lectures.html")
	tierce_link = "http://www.aelf.org/office-tierce?desktop=1&date_my=%s" % (site_date)
	urllib.urlretrieve(tierce_link, filename="3_Tierce.html")
	sexte_link = "http://www.aelf.org/office-sexte?desktop=1&date_my=%s" % (site_date)
	urllib.urlretrieve(sexte_link, filename="4_Sexte.html")
	none_link = "http://www.aelf.org/office-none?desktop=1&date_my=%s" % (site_date)
	urllib.urlretrieve(none_link, filename="5_None.html")
	vepres_link = "http://www.aelf.org/office-vepres?desktop=1&date_my=%s" % (site_date)
	urllib.urlretrieve(vepres_link, filename="6_Vepres.html")
	complies_link = "http://www.aelf.org/office-complies?desktop=1&date_my=%s" % (site_date)
	urllib.urlretrieve(complies_link, filename="7_Complies.html")
	html_doc = urlopen(next_link).read()
	#Extract ordo
	soup = BeautifulSoup(html_doc)
	ordo_text = soup.find("div", {"class": "bloc"})
	text_file = open("index.html", "w")
	for hidden in ordo_text.find_all(id='maBulle'):
		hidden.decompose()	
	part1 = """
	<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
	<html xmlns="http://www.w3.org/1999/xhtml">
	<head>
	<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
	</head>
	<body>
	"""
	part3 = """
	<div><a href="0_Messe.html">Messe</a>&nbsp;&nbsp;|&nbsp;&nbsp;
	<a href="1_Laudes.html">Laudes</a>&nbsp;&nbsp;|&nbsp;&nbsp;
	<a href="2_Lectures.html">Lectures</a>&nbsp;&nbsp;|&nbsp;&nbsp;
	<a href="3_Tierce.html">Tierce</a>&nbsp;&nbsp;|&nbsp;&nbsp;
	<a href="4_Sexte.html">Sexte</a>&nbsp;&nbsp;|&nbsp;&nbsp;
	<a href="5_None.html">None</a>&nbsp;&nbsp;|&nbsp;&nbsp;
	<a href="6_Vepres.html">Vepres</a>&nbsp;&nbsp;|&nbsp;&nbsp;
	<a href="7_Complies.html">Complies</a>
	<br><br>
	</div>
	<div style="text-align: center;"><a href="../index.html">Retour</a></div></body>
	</html>
	"""
	joined = "%s<h2>%s</h2>%s%s" % (part1, site_date, ordo_text, part3)
	text_file.write(joined)
	text_file.close()
	#Clean pages
	for filename in os.listdir('.'):
		if re.match(r'\d.*', filename):
			messy = open(filename, "r")
			soup = BeautifulSoup(messy)
			messy.close()
			for remove in soup.find_all(attrs={'class':['clr', 'goTop', 'print_only', 'change_country', 'abonnement', 'current', 'bloc', 'degre', 'base']}):
				remove.decompose()
			for remove in soup.find_all(id=['copyright', 'bas', 'menuHorizontal', 'colonneDroite', 'colonneGauche', 'font-resize', 'print_link', 'titre']):
				remove.decompose()
			cleaned = str(soup)
			output_file = open(filename, "w")
			output_file.write(cleaned)
	# Go to parent folder and add 1 day
	os.chdir("..")
	next_date = Base_date + datetime.timedelta(days=i)
entodoays is offline   Reply With Quote
Old 06-21-2014, 01:25 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,342
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You need to return a list of sections and article from parse_index() the code you posted is not returning anything. Read the API docs for parse_index() http://manual.calibre-ebook.com/news...pe.parse_index
kovidgoyal is offline   Reply With Quote
Old 06-25-2014, 11:40 AM   #5
entodoays
Zealot
entodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enough
 
entodoays's Avatar
 
Posts: 144
Karma: 706
Join Date: Oct 2011
Device: Sony Reader PRS-T1
Thank Kovid,

My python script works independently of Calibre for the time being. It:
  1. Creates a folder on disk
  2. Then creates subfolders for each "section" (dates)
  3. In each folder, it downloads seven html pages and cleans them with BeautifulSoup
  4. Create an index page in each folder
  5. In the base folder create an index page for all index pages

Then I can import the general index in Calibre and create an epub.

Ta transform this script into a recipe I have to change the folder creation and file downloading bits.

My question is: Is it possible to avoid using the normal "Section menu" news structure and replace it with my custom index page structure?

Please be patient; this is my first ever python script and first ever recipe.

Thanks.

The attached epub is the intended result.
Attached Files
File Type: epub Liturgie des Heures - AELF.epub (1.88 MB, 224 views)

Last edited by entodoays; 06-25-2014 at 11:46 AM.
entodoays is offline   Reply With Quote
Advert
Old 06-25-2014, 11:47 AM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,342
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
If you already have a script to create your epub why do you want a recipe? Just run your script using cron and use the calibredb command to add the resulting epub to calibre.
kovidgoyal is offline   Reply With Quote
Old 06-25-2014, 11:50 AM   #7
entodoays
Zealot
entodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enough
 
entodoays's Avatar
 
Posts: 144
Karma: 706
Join Date: Oct 2011
Device: Sony Reader PRS-T1
I want the script to be OS indepedent. I'm working on Linux but would like to share it with others on Windows or Mac.
entodoays is offline   Reply With Quote
Old 06-25-2014, 11:58 AM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,342
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Then you will have to figure out how to convert it to a recipe, I'm afraid I dont have the time to help you do that. The recipe API is extensively documented.
kovidgoyal is offline   Reply With Quote
Old 06-25-2014, 12:39 PM   #9
entodoays
Zealot
entodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enough
 
entodoays's Avatar
 
Posts: 144
Karma: 706
Join Date: Oct 2011
Device: Sony Reader PRS-T1
I'm new to python and to recipe building. Therefore I'll try to create a normal news recipe because I'm not familiar with python enough.

What I would need to do to myscript is to
  1. replace folder creation by parse.index section
  2. and file download to article addition.

The rest should remain pretty much the same.

Can you please give me a simple recipe that creates a section with one article from a single link (http://www.aelf.org/office-lectures?...my=22/6/2014)? I think I could build up from that.

Last edited by entodoays; 06-25-2014 at 01:23 PM. Reason: Asking for help
entodoays is offline   Reply With Quote
Old 06-25-2014, 04:04 PM   #10
entodoays
Zealot
entodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enough
 
entodoays's Avatar
 
Posts: 144
Karma: 706
Join Date: Oct 2011
Device: Sony Reader PRS-T1
My first "working" recipe

I managed to create a recipe which downloads something. I started from the built-in recipe for "The Atlantic" and tried modifying it accordingly.

I'm trying to get the recipe to download all the links in http://www.aelf.org/office-laudes which are found in the following div:
Code:
<div class="bloc" onMouseOver="mabulle.hide()">
    <ul>
        <li class=""> > <a href="/office-messe">Lecture de la messe</a></li>
        <li class="current"> > Liturgie des heures
            <ul>
                <li class=""> > <a href="/office-lectures">Lectures</a></li>
                <li class="current"> > <a href="/office-laudes">Laudes</a></li>
                <li class=""> > <a href="/office-tierce">Tierce</a></li>
                <li class=""> > <a href="/office-sexte">Sexte</a></li>
                <li class=""> > <a href="/office-none">None</a></li>
                <li class=""> > <a href="/office-vepres">Vêpres</a></li>
                <li class=""> > <a href="/office-complies">Complies</a></li>
            </ul>
        </li>
    </ul>
</div>
For some reason it only downloads the link to /office-lectures and /office-messe. I cannot understand why.

The following is my recipe:
Code:
#!/usr/bin/env  python

__license__   = 'GPL v3'
__copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>'
'''
aelf.org Liturgie des heures
'''
import re, datetime

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ebooks.BeautifulSoup import Tag, NavigableString

now = datetime.datetime.now() #Get today's date
idx = (now.weekday() + 1) % 7 #Get the day of the week
Base_date = now + datetime.timedelta(7-idx) #Get this Sunday's date
next_date = Base_date
site_date = "%s/%s/%s" % (next_date.day, next_date.month, next_date.year)

class AELF(BasicNewsRecipe):

    title      = 'AELF'
    __author__ = 'Kovid Goyal and Sujata Raman'
    description = 'Liturgie des Heures'
    INDEX = "http://www.aelf.org/office-laudes?desktop=1&date_my=%s" % (site_date)
    language = 'fr'

    """keep_only_tags = [{'attrs':{'class':['bloc']}}]"""
    remove_tags    = [dict(attrs={'class':['clr', 'goTop', 'print_only', 'change_country', 'abonnement', 'current', 'bloc', 'degre', 'bas']})]
    no_stylesheets = True

    def parse_index(self):
        articles = []
        soup = self.index_to_soup(self.INDEX)
        ts = soup.find('li')
        ds = self.tag_to_string(ts.find('h2')).split(':')[-1]
        self.timefmt = ' [%s]'%ds

        cover = soup.find('img', src=True, attrs={'alt':'logo de l\'association épiscopale liturgique pour les pays francophones'})

        if cover is not None:
            self.cover_url = 'http://www.aelf.org' + cover['src']
            self.log(self.cover_url)

        feeds = []
        seen_titles = set([])
        for section in soup.findAll('div', attrs={'id':'contenu'}):
            section_title = self.tag_to_string(section.find('li', attrs={'class':''}))
            self.log('Found section:', section_title)
            articles = []
            for post in section.findAll('li'):
                a = post.find('a', href=True)
                title = self.tag_to_string(a)
                """if title in seen_titles:
                    continue"""
                seen_titles.add(title)
                url = a['href']
                if url.startswith('/'):
                    url = 'http://www.aelf.org'+url+'?desktop=1&date_my=%s' % (site_date)
                p = post.parent.find('p', attrs={'class':'current'})
                desc = None
                self.log('\tFound article:', title, 'at', url)
                if p is not None:
                    desc = self.tag_to_string(p)
                    self.log('\t\t', desc)
                articles.append({'title':title, 'url':url, 'description':desc,
                    'date':''})
            if articles:
                feeds.append((section_title, articles))

        rightContent=soup.find('div', attrs={'class':['bloc']})
        for module in rightContent.findAll('li', attrs={'class':['']}):
            section_title = self.tag_to_string(INDEX.find('h1'))
            articles = []
            for post in module.findAll('div'):
                a = post.find('a', href=True)
                title = self.tag_to_string(a)
                if title in seen_titles:
                    continue
                seen_titles.add(title)
                url = a['href']
                if url.startswith('/'):
                    url = 'http://www.aelf.org'+url
                p = post.parent.find('p')
                desc = None
                self.log('\tFound article:', title, 'at', url)
                if p is not None:
                    desc = self.tag_to_string(p)
                    self.log('\t\t', desc)
                articles.append({'title':title, 'url':url, 'description':desc, 'date':''})
            if articles:
                feeds.append((section_title, articles))

        return feeds

    def postprocess_html(self, soup, first):
        for table in soup.findAll('table', align='right'):
            img = table.find('img')
            if img is not None:
                img.extract()
                caption = self.tag_to_string(table).strip()
                div = Tag(soup, 'div')
                div['style'] = 'text-align:center'
                div.insert(0, img)
                div.insert(1, Tag(soup, 'br'))
                if caption:
                    div.insert(2, NavigableString(caption))
                table.replaceWith(div)

        return soup
entodoays is offline   Reply With Quote
Old 06-26-2014, 05:48 AM   #11
entodoays
Zealot
entodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enoughentodoays will become famous soon enough
 
entodoays's Avatar
 
Posts: 144
Karma: 706
Join Date: Oct 2011
Device: Sony Reader PRS-T1
I'm trying to use parse_index by giving a manual list of pages to include.

The documentation says:
Quote:
The full article (can be an empty string). Obsolete
do not use, instead save the content to a temporary
file and pass a file:///path/to/temp/file.html as
the URL.
This is a code example but don't know how to really implement it.
Code:
#One article from one page
messe = dict()
messe['title'] = 'Lectures de la Messe'
messe['url'] = 'http://www.aelf.org/office-messe?desktop=1&date_my=%s' % (site_date)
messe['date'] = site_date
messe['description'] = 'LECTURES'
#One article from another page
laudes = dict()
laudes['title'] = 'Laudes'
laudes['url'] = 'http://www.aelf.org/office-laudes?desktop=1&date_my=%s' % (site_date)
laudes['date'] = site_date
laudes['description'] = 'LAUDES'
#Create list of two articles
list_of_articles = [messe, laudes]
#Give the list a feed title
feed_title = str(book_date)
#What to give to pase_index
self = [feed_title, list_of_articles]
Any hint, please?

As an alternative:
My script does download files to disk and creates a series of index files. Can I simply pass the main index file as the index?
entodoays is offline   Reply With Quote
Reply

Tags
calibre, help needed, recipe


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
TOC creation Brock Benjamin Conversion 1 09-14-2011 09:08 PM
Recipe works when mocked up as Python file, fails when converted to Recipe ode Recipes 7 09-04-2011 04:57 AM
need your help ,creation ottoman.. bookmania Amazon Kindle 5 05-23-2011 02:36 AM
TOC Creation taraboom11 Conversion 7 02-08-2011 12:40 AM
PDF creation jsfiller enTourage Archive 4 09-26-2010 10:46 PM


All times are GMT -4. The time now is 07:58 PM.


MobileRead.com is a privately owned, operated and funded community.