Custom recipes (archive, read-only) - Page 48

GRiker · 09-07-2009, 01:39 PM

macsilber: It would be more helpful if you could post the recipe you're using.

dmendozadmd: a 'sticky' is a popular topic that stays in the upper list of topics in the forum, so they're easier to find. A recipe is a script that calibre uses to download the contents of a particular website, then format it for your eReader.

G

Gomes · 09-07-2009, 03:50 PM

Quote:

Originally Posted by GRiker

Gomes,
There are RSS feeds in each section of philly.com. Follow the directions to create a custom feed, then ask for assistance if you get stuck. It's actually pretty simple.

G

I've been trying to get a clean copy for a couple of weeks with no success. Essentially, I am unable to get the print version of the stories. I've tried to go through the directions cited above, but that doesn't seem to help...What I end up with is the article with all the various menus, pictures, and comments, which makes it difficult to read at best, and takes forever for calibre to fetch and convert. Can anyone help?

And yes, I realize I'm just probably missing something obvious...

cix3 · 09-07-2009, 04:23 PM

In a custom recipe, how do I remove multiple div classes?

For example, from this source page (http://www.tnr.com/print/article/pol...ocking-roberts), I want to remove these div classes: print-logo, print-site_name, img-left, and print-source_url.

Probably a simple syntax question, but I'm new to Python. I have tried...

Code:

    remove_tags = [dict(name='div', attrs={'class':'print-logo'})]
    remove_tags = [dict(name='div', attrs={'class':'print-site_name'})]
    remove_tags = [dict(name='div', attrs={'class':'img-left'})]
    remove_tags = [dict(name='div', attrs={'class':'print-source_url'})]

... which only removes the last div class listed (in this case, print-source_url).

This gives me a syntax error:

Code:

    remove_tags = [dict(name='div', attrs={'class':'print-logo', 'print-site_name', 'img-left', 'print-source_url'})]

What is the correct syntax?

Thanks

kovidgoyal · 09-07-2009, 04:27 PM

Code:

remove_tags = [dict(name='div', attrs={'class':['print-logo', 'print-site_name', ..]}]

cix3 · 09-07-2009, 04:38 PM

Thanks... I knew it must have been something simple like that.

Your snippet as written gave me a syntax error, but adding a ) as the second to last character fixed it.

GRiker · 09-07-2009, 06:12 PM

gomes: Post your recipe. You will probably need to use remove_tags as cix3 has learned to get rid of the stuff you don't want.

Basically, this involves going to a sample page, examining the HTML source, isolating the stuff you don't want, then specifying a remove_tags directive as Kovid has described in his post above this one.

If you post your recipe, folks here are better able to help you refine it.

G

cix3 · 09-07-2009, 06:18 PM

Hello,

Here's my first stab at a recipe for The New Republic (www.tnr.com). It aggregates all articles and blogs, minus the images. Enjoy!

Code:

class The_New_Republic(BasicNewsRecipe):
    title = 'The New Republic'
    __author__ = 'cix3'
    description = 'Intelligent, stimulating and rigorous examination of American politics, foreign policy and culture'
    timefmt = ' [%b %d, %Y]'

    oldest_article = 7
    max_articles_per_feed = 100

    remove_tags = [dict(name='div', attrs={'class':['print-logo', 'print-site_name', 'img-left', 'print-source_url']}), dict(name='hr', attrs={'class':'print-hr'}), dict(name='img')]

    feeds = [
        ('Politics', 'http://www.tnr.com/rss/articles/Politics'),
        ('Books and Arts', 'http://www.tnr.com/rss/articles/Books-and-Arts'),
        ('Economy', 'http://www.tnr.com/rss/articles/Economy'),
        ('Environment and Energy', 'http://www.tnr.com/rss/articles/Environment-%2526-Energy'),
        ('Health Care', 'http://www.tnr.com/rss/articles/Health-Care'),
        ('Urban Policy', 'http://www.tnr.com/rss/articles/Urban-Policy'),
        ('World', 'http://www.tnr.com/rss/articles/World'),
        ('Film', 'http://www.tnr.com/rss/articles/Film'),
        ('Books', 'http://www.tnr.com/rss/articles/books'),
        ('The Plank', 'http://www.tnr.com/rss/blogs/The-Plank'),
        ('The Treatment', 'http://www.tnr.com/rss/blogs/The-Treatment'),
        ('The Spine', 'http://www.tnr.com/rss/blogs/The-Spine'),
        ('The Stash', 'http://www.tnr.com/rss/blogs/The-Stash'),
        ('The Vine', 'http://www.tnr.com/rss/blogs/The-Vine'),
        ('The Avenue', 'http://www.tnr.com/rss/blogs/The-Avenue'),
        ('William Galston', 'http://www.tnr.com/rss/blogs/William-Galston'),
        ('Simon Johnson', 'http://www.tnr.com/rss/blogs/Simon-Johnson'),
        ('Ed Kilgore', 'http://www.tnr.com/rss/blogs/Ed-Kilgore'),
        ('Damon Linker', 'http://www.tnr.com/rss/blogs/Damon-Linker'),
        ('John McWhorter', 'http://www.tnr.com/rss/blogs/John-McWhorter')
            ]

    def print_version(self, url):
        return url.replace('http://www.tnr.com/', 'http://www.tnr.com/print/')

bhandarisaurabh · 09-09-2009, 10:20 PM

can anyone help me with recipe of business standard
if the url for the article is
http://www.business-standard.com/ind...?autono=369650
then print url is
http://www.business-standard.com/ind...ono=369650&tp=

cutterjohn42 · 09-10-2009, 10:38 AM

It seems that the most recent version of the /. recipe in Calibre may have caused an auto-ban to be triggered for my IP address.

I noticed the last time that it seemed to be downloading more of the site than before, i.e. I had the article + comments, and I think that the way the site is setup that it leads to recursively downloading most of the site unless strictly limited. I used to have that problem with sitescooper and plucker and have to be very careful about limiting how much of /. was spidered to create a document for offline reading.

(This would be the version included with 0.6.11 .)

kovidgoyal · 09-10-2009, 11:44 AM

Open a ticket about it, I'll look at it when I have a spare moment.

cix3 · 09-10-2009, 09:32 PM

Any idea how I can transform an article URL like this (http://www.motherjones.com/politics/...-job-van-jones) into the print URL (http://www.motherjones.com/print/27151) that I want to use for my recipe?

I'm hoping there's an easy way to find corresponding print URLs (by that 5 digit number) for articles. Rather than removing all unwanted html from the actual article...

Any ideas?

Edit: I should also note that the original article page actually splits the article into multiple pages (which I would want to combine into one article for my recipe). The print version lists the entire article.

kovidgoyal · 09-10-2009, 09:43 PM

Just fetch the HTML and parse it looking for the print link

cix3 · 09-10-2009, 09:49 PM

Quote:

Originally Posted by kovidgoyal

Just fetch the HTML and parse it looking for the print link

Can you give me an example of a built-in recipe that does this?

kovidgoyal · 09-10-2009, 10:21 PM

Cant think of one off hand but basically, it's something like this

Code:

def get_article_url(self, article):
   url = ...(from article as before)
   soup = self.index_to_soup(url)
   # do some processing on soup to find the full article link
   a = soup.find(name='a', href=True, text=re.compile(r'Full\s*Article')
   if a is not None:
      return a['href']
   return url

Stick a few print statements in there to debug things

cix3 · 09-11-2009, 12:28 AM

Quote:

Originally Posted by kovidgoyal

Cant think of one off hand but basically, it's something like this

Code:

def get_article_url(self, article):
   url = ...(from article as before)
   soup = self.index_to_soup(url)
   # do some processing on soup to find the full article link
   a = soup.find(name='a', href=True, text=re.compile(r'Full\s*Article')
   if a is not None:
      return a['href']
   return url

Stick a few print statements in there to debug things

Hmmm... that's beyond my level of expertise. I'm going to have to wait for someone else to recommend a pre-built recipe that I can copy from.

Thanks!

09-07-2009, 04:23 PM	#708
cix3 Member Posts: 14 Karma: 10 Join Date: Aug 2009 Device: Kindle 2	In a custom recipe, how do I remove multiple div classes? For example, from this source page (http://www.tnr.com/print/article/pol...ocking-roberts), I want to remove these div classes: print-logo, print-site_name, img-left, and print-source_url. Probably a simple syntax question, but I'm new to Python. I have tried... Code: remove_tags = [dict(name='div', attrs={'class':'print-logo'})] remove_tags = [dict(name='div', attrs={'class':'print-site_name'})] remove_tags = [dict(name='div', attrs={'class':'img-left'})] remove_tags = [dict(name='div', attrs={'class':'print-source_url'})] ... which only removes the last div class listed (in this case, print-source_url). This gives me a syntax error: Code: remove_tags = [dict(name='div', attrs={'class':'print-logo', 'print-site_name', 'img-left', 'print-source_url'})] What is the correct syntax? Thanks

09-07-2009, 04:27 PM	#709
kovidgoyal creator of calibre Posts: 45,733 Karma: 28549306 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Code: remove_tags = [dict(name='div', attrs={'class':['print-logo', 'print-site_name', ..]}]

09-10-2009, 10:38 AM	#714
cutterjohn42 Addict Posts: 274 Karma: 1029955 Join Date: Feb 2009 Device: Palm IIIx, EBM-911, REB-1100(dead), PRS-505	It seems that the most recent version of the /. recipe in Calibre may have caused an auto-ban to be triggered for my IP address. I noticed the last time that it seemed to be downloading more of the site than before, i.e. I had the article + comments, and I think that the way the site is setup that it leads to recursively downloading most of the site unless strictly limited. I used to have that problem with sitescooper and plucker and have to be very careful about limiting how much of /. was spidered to create a document for offline reading. (This would be the version included with 0.6.11 .) Last edited by cutterjohn42; 09-10-2009 at 10:48 AM.

09-10-2009, 09:32 PM	#716
cix3 Member Posts: 14 Karma: 10 Join Date: Aug 2009 Device: Kindle 2	Any idea how I can transform an article URL like this (http://www.motherjones.com/politics/...-job-van-jones) into the print URL (http://www.motherjones.com/print/27151) that I want to use for my recipe? I'm hoping there's an easy way to find corresponding print URLs (by that 5 digit number) for articles. Rather than removing all unwanted html from the actual article... Any ideas? Edit: I should also note that the original article page actually splits the article into multiple pages (which I would want to combine into one article for my recipe). The print version lists the entire article. Last edited by cix3; 09-10-2009 at 09:38 PM. Reason: Add text

09-10-2009, 10:21 PM	#719
kovidgoyal creator of calibre Posts: 45,733 Karma: 28549306 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Cant think of one off hand but basically, it's something like this Code: def get_article_url(self, article): url = ...(from article as before) soup = self.index_to_soup(url) # do some processing on soup to find the full article link a = soup.find(name='a', href=True, text=re.compile(r'Full\s*Article') if a is not None: return a['href'] return url Stick a few print statements in there to debug things

09-07-2009, 01:39 PM	#706
GRiker Comparer of the Ephemeris Posts: 1,496 Karma: 424697 Join Date: Mar 2009 Device: iPad	macsilber: It would be more helpful if you could post the recipe you're using. dmendozadmd: a 'sticky' is a popular topic that stays in the upper list of topics in the forum, so they're easier to find. A recipe is a script that calibre uses to download the contents of a particular website, then format it for your eReader. G

09-07-2009, 04:38 PM	#710
cix3 Member Posts: 14 Karma: 10 Join Date: Aug 2009 Device: Kindle 2	Thanks... I knew it must have been something simple like that. Your snippet as written gave me a syntax error, but adding a ) as the second to last character fixed it.

09-07-2009, 06:12 PM	#711
GRiker Comparer of the Ephemeris Posts: 1,496 Karma: 424697 Join Date: Mar 2009 Device: iPad	gomes: Post your recipe. You will probably need to use remove_tags as cix3 has learned to get rid of the stuff you don't want. Basically, this involves going to a sample page, examining the HTML source, isolating the stuff you don't want, then specifying a remove_tags directive as Kovid has described in his post above this one. If you post your recipe, folks here are better able to help you refine it. G

09-09-2009, 10:20 PM	#713
bhandarisaurabh Enthusiast Posts: 49 Karma: 10 Join Date: Aug 2009 Device: none	can anyone help me with recipe of business standard if the url for the article is http://www.business-standard.com/ind...?autono=369650 then print url is http://www.business-standard.com/ind...ono=369650&tp=

09-10-2009, 11:44 AM	#715
kovidgoyal creator of calibre Posts: 45,733 Karma: 28549306 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Open a ticket about it, I'll look at it when I have a spare moment.

09-10-2009, 09:43 PM	#717
kovidgoyal creator of calibre Posts: 45,733 Karma: 28549306 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Just fetch the HTML and parse it looking for the print link

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 03:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 01:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 06:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 05:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 03:37 PM

Advert

Advert