![]() |
#1 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 144
Karma: 706
Join Date: Oct 2011
Device: Sony Reader PRS-T1
|
New recipe creation help
I would like to create a recipe for a site whose rss feed does not contain the content required.
Basically, I would like a recipe that does the following: Get the following links from the site for a 60 day period, starting from today, each time changing the date bit at the end.
Create an index page for each day containing links to the 7 pages. Finally create an table of contents style page for the days. I've never created recipes yet and am willing to learn. I would be grateful for any help. Last edited by entodoays; 06-18-2014 at 01:17 PM. Reason: Added more info |
![]() |
![]() |
![]() |
#2 | |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 144
Karma: 706
Join Date: Oct 2011
Device: Sony Reader PRS-T1
|
A start
Since I'm new to recipe creation I started off with an exampleon the calibre documentation and started modifying it.
My first objective is to be able to get a single page to parse correctly. Then when I manage I'll try to add further steps. My recipe looks as follows: Code:
import string, re from calibre import strftime from calibre.web.feeds.recipes import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import BeautifulSoup class NYTimes(BasicNewsRecipe): title = 'Liturgie des Heures' __author__ = 'Chris Vella' description = 'La liturgie des heures' timefmt = ' [%a, %d %b, %Y]' needs_subscription = False remove_tags_before = dict(name='h1') remove_tags_after = [dict(id='print_only')] remove_tags = [dict(attrs={'class':['articleTools', 'post-tools', 'side_tool', 'nextArticleLink clearfix']}), dict(id=['menuHorizontal', 'colonneDroite', 'niveau', 'don', 'font-resize', 'print_link']), dict(name=['script', 'noscript', 'style'])] encoding = 'utf8' no_stylesheets = True extra_css = 'h1 {font: sans-serif large;}\n.byline {font:monospace;}' def parse_index(self): soup = self.index_to_soup('www.aelf.org/office-messe\?desktop=1&date_my=%d/%b/%Y') def feed_title(div): return ''.join(div.findAll(text=True, recursive=False)).strip() articles = {} key = None ans = [] for div in soup.findAll(True, attrs={'class':['current']}): if div['class'] == 'current': key = string.capwords(feed_title(div)) articles[key] = [] ans.append(key) elif div['class'] in ['current']: a = div.find('a', href=True) if not a: continue url = re.sub(r'\?.*', '', a['href']) url += '?pagewanted=all' title = self.tag_to_string(a, use_alt=True).strip() description = '' pubdate = strftime('%a, %d %b') summary = div.find(True, attrs={'class':'summary'}) if summary: description = self.tag_to_string(summary, use_alt=False) feed = key if key is not None else 'Uncategorized' if not articles.has_key(feed): articles[feed] = [] if not 'podcasts' in url: articles[feed].append( dict(title=title, url=url, date=pubdate, description=description, content='')) ans = self.sort_index_by(ans, {'The Front Page':-1, 'Dining In, Dining Out':1, 'Obituaries':2}) ans = [(key, articles[key]) for key in ans if articles.has_key(key)] return ans def preprocess_html(self, soup): refresh = soup.find('meta', {'http-equiv':'refresh'}) if refresh is None: return soup content = refresh.get('content').partition('=')[2] raw = self.browser.open('http://www.nytimes.com'+content).read() return BeautifulSoup(raw.decode('utf8', 'replace')) Quote:
Thanks. |
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 144
Karma: 706
Join Date: Oct 2011
Device: Sony Reader PRS-T1
|
Working python script
I wrote a python script which downloads the pages and places them in separate folders by date. It creates an index file for each day. Then cleans the pages using Beautifulsoup.
Can anyone help to transform it into a recipe? Here's the script Code:
#!/bin/python import datetime, os, urllib, re from urllib import urlopen from bs4 import BeautifulSoup now = datetime.datetime.now() #Get today's date os.chdir(os.environ['HOME']) #Go to home folder Base_folder = r'Breviaire_%s-%s-%s' % (now.day, now.month, now.year) #All files will be stored in this date-stamped folder if not os.path.exists(Base_folder): os.makedirs(Base_folder) #Create a folder with today's date os.chdir(Base_folder) #Go to the freshly created folder idx = (now.weekday() + 1) % 7 #Get the day of the week Base_date = now + datetime.timedelta(7-idx) #Get this Sunday's date next_date = Base_date #Download the files for x days for i in range(0, 4): next_folder = r'%s-%s-%s' % (next_date.year, next_date.month, next_date.day) if not os.path.exists(next_folder): os.makedirs(next_folder) os.chdir(next_folder) site_date = "%s/%s/%s" % (next_date.day, next_date.month, next_date.year) next_link = "http://www.aelf.org/office-messe?desktop=1&date_my=%s" % (site_date) urllib.urlretrieve(next_link, filename="0_Messe.html") laudes_link = "http://www.aelf.org/office-laudes?desktop=1&date_my=%s" % (site_date) urllib.urlretrieve(laudes_link, filename="1_Laudes.html") lectures_link = "http://www.aelf.org/office-lectures?desktop=1&date_my=%s" % (site_date) urllib.urlretrieve(lectures_link, filename="2_Lectures.html") tierce_link = "http://www.aelf.org/office-tierce?desktop=1&date_my=%s" % (site_date) urllib.urlretrieve(tierce_link, filename="3_Tierce.html") sexte_link = "http://www.aelf.org/office-sexte?desktop=1&date_my=%s" % (site_date) urllib.urlretrieve(sexte_link, filename="4_Sexte.html") none_link = "http://www.aelf.org/office-none?desktop=1&date_my=%s" % (site_date) urllib.urlretrieve(none_link, filename="5_None.html") vepres_link = "http://www.aelf.org/office-vepres?desktop=1&date_my=%s" % (site_date) urllib.urlretrieve(vepres_link, filename="6_Vepres.html") complies_link = "http://www.aelf.org/office-complies?desktop=1&date_my=%s" % (site_date) urllib.urlretrieve(complies_link, filename="7_Complies.html") html_doc = urlopen(next_link).read() #Extract ordo soup = BeautifulSoup(html_doc) ordo_text = soup.find("div", {"class": "bloc"}) text_file = open("index.html", "w") for hidden in ordo_text.find_all(id='maBulle'): hidden.decompose() part1 = """ <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> </head> <body> """ part3 = """ <div><a href="0_Messe.html">Messe</a> | <a href="1_Laudes.html">Laudes</a> | <a href="2_Lectures.html">Lectures</a> | <a href="3_Tierce.html">Tierce</a> | <a href="4_Sexte.html">Sexte</a> | <a href="5_None.html">None</a> | <a href="6_Vepres.html">Vepres</a> | <a href="7_Complies.html">Complies</a> <br><br> </div> <div style="text-align: center;"><a href="../index.html">Retour</a></div></body> </html> """ joined = "%s<h2>%s</h2>%s%s" % (part1, site_date, ordo_text, part3) text_file.write(joined) text_file.close() #Clean pages for filename in os.listdir('.'): if re.match(r'\d.*', filename): messy = open(filename, "r") soup = BeautifulSoup(messy) messy.close() for remove in soup.find_all(attrs={'class':['clr', 'goTop', 'print_only', 'change_country', 'abonnement', 'current', 'bloc', 'degre', 'base']}): remove.decompose() for remove in soup.find_all(id=['copyright', 'bas', 'menuHorizontal', 'colonneDroite', 'colonneGauche', 'font-resize', 'print_link', 'titre']): remove.decompose() cleaned = str(soup) output_file = open(filename, "w") output_file.write(cleaned) # Go to parent folder and add 1 day os.chdir("..") next_date = Base_date + datetime.timedelta(days=i) |
![]() |
![]() |
![]() |
#4 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,342
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You need to return a list of sections and article from parse_index() the code you posted is not returning anything. Read the API docs for parse_index() http://manual.calibre-ebook.com/news...pe.parse_index
|
![]() |
![]() |
![]() |
#5 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 144
Karma: 706
Join Date: Oct 2011
Device: Sony Reader PRS-T1
|
Thank Kovid,
My python script works independently of Calibre for the time being. It:
Then I can import the general index in Calibre and create an epub. Ta transform this script into a recipe I have to change the folder creation and file downloading bits. My question is: Is it possible to avoid using the normal "Section menu" news structure and replace it with my custom index page structure? Please be patient; this is my first ever python script and first ever recipe. Thanks. The attached epub is the intended result. Last edited by entodoays; 06-25-2014 at 11:46 AM. |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,342
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
If you already have a script to create your epub why do you want a recipe? Just run your script using cron and use the calibredb command to add the resulting epub to calibre.
|
![]() |
![]() |
![]() |
#7 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 144
Karma: 706
Join Date: Oct 2011
Device: Sony Reader PRS-T1
|
I want the script to be OS indepedent. I'm working on Linux but would like to share it with others on Windows or Mac.
|
![]() |
![]() |
![]() |
#8 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,342
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Then you will have to figure out how to convert it to a recipe, I'm afraid I dont have the time to help you do that. The recipe API is extensively documented.
|
![]() |
![]() |
![]() |
#9 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 144
Karma: 706
Join Date: Oct 2011
Device: Sony Reader PRS-T1
|
I'm new to python and to recipe building. Therefore I'll try to create a normal news recipe because I'm not familiar with python enough.
What I would need to do to myscript is to
The rest should remain pretty much the same. Can you please give me a simple recipe that creates a section with one article from a single link (http://www.aelf.org/office-lectures?...my=22/6/2014)? I think I could build up from that. Last edited by entodoays; 06-25-2014 at 01:23 PM. Reason: Asking for help |
![]() |
![]() |
![]() |
#10 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 144
Karma: 706
Join Date: Oct 2011
Device: Sony Reader PRS-T1
|
My first "working" recipe
I managed to create a recipe which downloads something. I started from the built-in recipe for "The Atlantic" and tried modifying it accordingly.
I'm trying to get the recipe to download all the links in http://www.aelf.org/office-laudes which are found in the following div: Code:
<div class="bloc" onMouseOver="mabulle.hide()"> <ul> <li class=""> > <a href="/office-messe">Lecture de la messe</a></li> <li class="current"> > Liturgie des heures <ul> <li class=""> > <a href="/office-lectures">Lectures</a></li> <li class="current"> > <a href="/office-laudes">Laudes</a></li> <li class=""> > <a href="/office-tierce">Tierce</a></li> <li class=""> > <a href="/office-sexte">Sexte</a></li> <li class=""> > <a href="/office-none">None</a></li> <li class=""> > <a href="/office-vepres">Vêpres</a></li> <li class=""> > <a href="/office-complies">Complies</a></li> </ul> </li> </ul> </div> The following is my recipe: Code:
#!/usr/bin/env python __license__ = 'GPL v3' __copyright__ = '2008, Kovid Goyal <kovid at kovidgoyal.net>' ''' aelf.org Liturgie des heures ''' import re, datetime from calibre.web.feeds.news import BasicNewsRecipe from calibre.ebooks.BeautifulSoup import Tag, NavigableString now = datetime.datetime.now() #Get today's date idx = (now.weekday() + 1) % 7 #Get the day of the week Base_date = now + datetime.timedelta(7-idx) #Get this Sunday's date next_date = Base_date site_date = "%s/%s/%s" % (next_date.day, next_date.month, next_date.year) class AELF(BasicNewsRecipe): title = 'AELF' __author__ = 'Kovid Goyal and Sujata Raman' description = 'Liturgie des Heures' INDEX = "http://www.aelf.org/office-laudes?desktop=1&date_my=%s" % (site_date) language = 'fr' """keep_only_tags = [{'attrs':{'class':['bloc']}}]""" remove_tags = [dict(attrs={'class':['clr', 'goTop', 'print_only', 'change_country', 'abonnement', 'current', 'bloc', 'degre', 'bas']})] no_stylesheets = True def parse_index(self): articles = [] soup = self.index_to_soup(self.INDEX) ts = soup.find('li') ds = self.tag_to_string(ts.find('h2')).split(':')[-1] self.timefmt = ' [%s]'%ds cover = soup.find('img', src=True, attrs={'alt':'logo de l\'association épiscopale liturgique pour les pays francophones'}) if cover is not None: self.cover_url = 'http://www.aelf.org' + cover['src'] self.log(self.cover_url) feeds = [] seen_titles = set([]) for section in soup.findAll('div', attrs={'id':'contenu'}): section_title = self.tag_to_string(section.find('li', attrs={'class':''})) self.log('Found section:', section_title) articles = [] for post in section.findAll('li'): a = post.find('a', href=True) title = self.tag_to_string(a) """if title in seen_titles: continue""" seen_titles.add(title) url = a['href'] if url.startswith('/'): url = 'http://www.aelf.org'+url+'?desktop=1&date_my=%s' % (site_date) p = post.parent.find('p', attrs={'class':'current'}) desc = None self.log('\tFound article:', title, 'at', url) if p is not None: desc = self.tag_to_string(p) self.log('\t\t', desc) articles.append({'title':title, 'url':url, 'description':desc, 'date':''}) if articles: feeds.append((section_title, articles)) rightContent=soup.find('div', attrs={'class':['bloc']}) for module in rightContent.findAll('li', attrs={'class':['']}): section_title = self.tag_to_string(INDEX.find('h1')) articles = [] for post in module.findAll('div'): a = post.find('a', href=True) title = self.tag_to_string(a) if title in seen_titles: continue seen_titles.add(title) url = a['href'] if url.startswith('/'): url = 'http://www.aelf.org'+url p = post.parent.find('p') desc = None self.log('\tFound article:', title, 'at', url) if p is not None: desc = self.tag_to_string(p) self.log('\t\t', desc) articles.append({'title':title, 'url':url, 'description':desc, 'date':''}) if articles: feeds.append((section_title, articles)) return feeds def postprocess_html(self, soup, first): for table in soup.findAll('table', align='right'): img = table.find('img') if img is not None: img.extract() caption = self.tag_to_string(table).strip() div = Tag(soup, 'div') div['style'] = 'text-align:center' div.insert(0, img) div.insert(1, Tag(soup, 'br')) if caption: div.insert(2, NavigableString(caption)) table.replaceWith(div) return soup |
![]() |
![]() |
![]() |
#11 | |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 144
Karma: 706
Join Date: Oct 2011
Device: Sony Reader PRS-T1
|
I'm trying to use parse_index by giving a manual list of pages to include.
The documentation says: Quote:
Code:
#One article from one page messe = dict() messe['title'] = 'Lectures de la Messe' messe['url'] = 'http://www.aelf.org/office-messe?desktop=1&date_my=%s' % (site_date) messe['date'] = site_date messe['description'] = 'LECTURES' #One article from another page laudes = dict() laudes['title'] = 'Laudes' laudes['url'] = 'http://www.aelf.org/office-laudes?desktop=1&date_my=%s' % (site_date) laudes['date'] = site_date laudes['description'] = 'LAUDES' #Create list of two articles list_of_articles = [messe, laudes] #Give the list a feed title feed_title = str(book_date) #What to give to pase_index self = [feed_title, list_of_articles] As an alternative: My script does download files to disk and creates a series of index files. Can I simply pass the main index file as the index? |
|
![]() |
![]() |
![]() |
Tags |
calibre, help needed, recipe |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
TOC creation | Brock Benjamin | Conversion | 1 | 09-14-2011 09:08 PM |
Recipe works when mocked up as Python file, fails when converted to Recipe | ode | Recipes | 7 | 09-04-2011 04:57 AM |
need your help ,creation ottoman.. | bookmania | Amazon Kindle | 5 | 05-23-2011 02:36 AM |
TOC Creation | taraboom11 | Conversion | 7 | 02-08-2011 12:40 AM |
PDF creation | jsfiller | enTourage Archive | 4 | 09-26-2010 10:46 PM |