Custom recipes (archive, read-only) - Page 179

TonytheBookworm · 09-07-2010, 12:05 PM

Quote:

Originally Posted by Starson17

Everyone is so friendly here, I know it's a temptation to stray off topic, but you probably should go to another thread with this. The recipe thread is tough to wade through as it is with all the lengthy recipes. <GRIN>

Yeah sorry about that. If a mod can move the post that would be great. again sorry.

poloman · 09-07-2010, 12:31 PM

fair point well made - will pm you Tony. doh - thought I'd be able to delete this post - sorry for this - if someone could delete it please!

edit: bringing it back on topic - I (lazily) added a simple feed for slashdot (http://rss.slashdot.org/Slashdot/slashdot) as I didn't want all the comments - the prospect of getting banned using the built in recipe deterred me from using it, and it takes a long time to run.

However, the simple feed results, when it appears on the kindle, shows the article summary fine in the sections view (ie the article title and the beginnings of the article), but when i click to read it, the article and header are not there - just the comments.

Is there a simple solution, or a recipe that solves this? I tried making one that keeps only the artle section, but didnt have much luck: <annoyingly, i seem to have deleted it - but have this one which shows the general idea>

Spoiler:

TonytheBookworm · 09-09-2010, 11:44 AM

Quote:

Originally Posted by poloman

edit: bringing it back on topic - I (lazily) added a simple feed for slashdot (http://rss.slashdot.org/Slashdot/slashdot) as I didn't want all the comments - the prospect of getting banned using the built in recipe deterred me from using it, and it takes a long time to run.

However, the simple feed results, when it appears on the kindle, shows the article summary fine in the sections view (ie the article title and the beginnings of the article), but when i click to read it, the article and header are not there - just the comments.

Is there a simple solution, or a recipe that solves this? I tried making one that keeps only the artle section, but didnt have much luck: <annoyingly, i seem to have deleted it - but have this one which shows the general idea>

Spoiler:

change it to look like this
[spoiler]

Code:

keep_only_tags = [
                    dict(name='a', attrs={'class':'datitle'}),
                    dict(name='span', attrs={'class':'date'}),
                    dict(name='div',attrs={'class':'body'})
                   ]

you will get the title then date next to each other. In that case you would probably wanna do a preprocess_html and insert a somehow another. I haven't mastered the inserting part yet

Starson17 · 09-09-2010, 12:46 PM

Quote:

Originally Posted by TonytheBookworm

In that case you would probably wanna do a preprocess_html and insert a somehow another. I haven't mastered the inserting part yet

Have you ever seen that puzzle with 3 posts and disks of increasing size? Playing with new tags, inserting tags, moving tags, etc. is like that puzzle.

You might want to review my mods to your Buckmaster recipe here:
https://www.mobileread.com/forums/sho...postcount=2651
It adds surrounding images (instead of inserting a ).

Beautiful Soup lets you create tags easily. It lets you replace one tag with another (replaceWith). It lets you insert a tag into another tag, but it requires you to insert by location number. The problem is that you either have to calculate the correct location number, or you have to replace a tag with a newly created tag. The first is a pain. The second is tricky if you want to keep the contents of the tag being replaced.

One approach can be seen here:

Code:

        for img_tag in soup.findAll('img'): 
            parent_tag = img_tag.parent 
            if parent_tag.name == 'a': 
                new_tag = Tag(soup,'p') 
                new_tag.insert(0,img_tag) 
                #at this point img_tag has been extracted from soup 
                #and put into new_tag - parent_tag remains in soup
                parent_tag.replaceWith(new_tag)

TonytheBookworm · 09-09-2010, 11:40 PM

Quote:

Originally Posted by Starson17

Have you ever seen that puzzle with 3 posts and disks of increasing size? Playing with new tags, inserting tags, moving tags, etc. is like that puzzle.

You might want to review my mods to your Buckmaster recipe here:
https://www.mobileread.com/forums/sho...postcount=2651
It adds surrounding images (instead of inserting a ).

Yeah, even though I understand what you did in this instance. The tag replacements for me are still a thing I'm a little confused about. Just going to take time to fully understand it. As for the puzzle never played that one but I know what your talking about. Anyway, take care man. I also wanna thank you for those linked you posted they are very informative.

Oh, one other thing. Is this the place we should post recipes (as in one's we consider working and complete?) Or do we submit them on bug tracker or where?

Starson17 · 09-10-2010, 07:41 AM

Quote:

Originally Posted by TonytheBookworm

Oh, one other thing. Is this the place we should post recipes (as in one's we consider working and complete?) Or do we submit them on bug tracker or where?

Kovid seems to watch both places, but I tend to think of the bug tracker as the best place to indicate you think the recipe is ready for inclusion. This is the best place to make it available for others.

somedayson · 09-10-2010, 01:11 PM

Thanks for all the previous help...got an awesome customized news reader on my new kindle thanks to Calibre and your help.

It's been great to go into the recipes and add and remove sections of rss feeds from newspapers....awesome!

One last one I can't seem to figure out...it's a hockey team's feeds. Maybe not of much interest to many, but perhaps posting the recipe could help someone else like it did for me as I read through all 176 pages at the time (crazy, right?)

Here's the feed if someone would be willing to take a shot or help point me in the right direction. When I click on the two RSS links, they bring up some weird stuff I haven't seen before--sometimes, and then other times I get the articles.

Web page: http://blackhawks.nhl.com/club/feedinfo.htm

RSS #1: http://blackhawks.nhl.com/rss/top-stories.xml
Rss#2: http://blackhawks.nhl.com/rss/news.xml

I really appreciate everyone's help in learning this system!

Thanks,
Matt

somedayson · 09-10-2010, 01:32 PM

Some more info about the above request.

I'm just just not sure how to pull up the print portion. Here's the example:

http://blackhawks.nhl.com/club/news....rss-blackhawks
(regular page)

http://blackhawks.nhl.com/club/newsprint.htm?id=533848
(print version)

You can see that I need to insert "print" right after news in the first feed and drop everything after the "&" in the first feed

Here's the best rss to work from:

http://blackhawks.nhl.com/club/newsi...location=/news

Thanks again for any help anyone can provide,
Matt

Starson17 · 09-10-2010, 02:17 PM

Quote:

Originally Posted by somedayson

You can see that I need to insert "print" right after news in the first feed and drop everything after the "&" in the first feed

This part is easy:

Code:

    def print_version(self, url):
        main1, replace1, end1 = url.partition('news.htm?')
        url = main1 + 'newsprint.htm?' + end1
        main2, middle2, end2 = url.partition('&')
        return main2

The partition and rpartition functions are made for splitting up urls. The url.partition('news.htm?') takes out the quoted section, with the next line replacing it with 'newsprint.html?' and url.partition('&') just splits off the front part, which gets returned.

somedayson · 09-10-2010, 03:17 PM

Thanks Starson...I'm now getting about 10 "pages" on my kindle of the headlines and all the links beyond those. I've got Firefox and Firebug, and am trying lots of "keep only" and "remove only" tags but I can't quite find what the article content is labeled.

I really appreciate the spirit of teaching and learning that happens here.

somedayson · 09-10-2010, 03:31 PM

Here's my latest attempt...still can't exclude the junk above and below the articles. Tried all the pages of web pages a few pages early on this, but don't quite have it.

Code:

class AdvancedUserRecipe1284145178(BasicNewsRecipe):
    title          = u'Blackhawks Headlines'
    oldest_article = 7
    max_articles_per_feed = 100

    feeds          = [(u'Blackhawks Recent Headlines', u'http://blackhawks.nhl.com/rss/news.xml')]

def print_version(self, url):
        main1, replace1, end1 = url.partition('news.htm?')
        url = main1 + 'newsprint.htm?' + end1
        main2, middle2, end2 = url.partition('&')
        return main2

        keep_only_tags [dict(name='div', attrs={'class':'newsBody'})]

After about three hours on this total, I'd just love the answer if someone is willing to throw me a bone. I know I'm close...

Starson17 · 09-10-2010, 04:08 PM

Quote:

Originally Posted by somedayson

Here's my latest attempt...still can't exclude the junk above and below the articles. Tried all the pages of web pages a few pages early on this, but don't quite have it.

Spoiler:

After about three hours on this total, I'd just love the answer if someone is willing to throw me a bone. I know I'm close...

Your print_version isn't running. It needs to be indented to run. You don't need the keep_only_tags. Try this:

Spoiler:

It should be close. (I threw in some basic formatting.)

bhandarisaurabh · 09-10-2010, 10:17 PM

Quote:

Originally Posted by TonytheBookworm

Print edition? As in subscribed? Or As in whats on the page as you see it? Or the rss link that is over on the right hand side?

If your calling the "print edition" what you see currently on the screen when you go to that link I don't see the point in doing it. Because each month/week that the issue changes you are going to have to change the feed reference from 148 to Nth Or am I'm missing your question completely ?

okay got you point but can you make recipe of print edition of magazine
http://downtoearth.org.in/archives/
It is a humble request

TonytheBookworm · 09-11-2010, 01:04 AM

Starson17,
If i wanted an if statement that checked if the parent was <div id='MainContent'>
how would I go about doing it?
would it be

Code:

mydaddy = item.parent
if mydaddy.name = 'MainContent'
  .......

I seen examples of how you and others do a parent match for <a> and and so forth but not for an actual div id... tag

thanks

TonytheBookworm · 09-11-2010, 02:37 AM

Quote:

Originally Posted by bhandarisaurabh

okay got you point but can you make recipe of print edition of magazine
http://downtoearth.org.in/archives/
It is a humble request

Here you go I only done 2010. Each year appears to have different formatting but a years worth of stuff should be enough for now

09-07-2010, 12:31 PM	#2672
poloman Enthusiast Posts: 25 Karma: 10 Join Date: Nov 2008 Device: PRS505, Kindle 3G	fair point well made - will pm you Tony. doh - thought I'd be able to delete this post - sorry for this - if someone could delete it please! edit: bringing it back on topic - I (lazily) added a simple feed for slashdot (http://rss.slashdot.org/Slashdot/slashdot) as I didn't want all the comments - the prospect of getting banned using the built in recipe deterred me from using it, and it takes a long time to run. However, the simple feed results, when it appears on the kindle, shows the article summary fine in the sections view (ie the article title and the beginnings of the article), but when i click to read it, the article and header are not there - just the comments. Is there a simple solution, or a recipe that solves this? I tried making one that keeps only the artle section, but didnt have much luck: <annoyingly, i seem to have deleted it - but have this one which shows the general idea> Spoiler: from calibre.web.feeds.news import BasicNewsRecipe class SlashDotRecipe(BasicNewsRecipe): title = 'SlashDot' #v1 language = 'en' __author__ = 'db' description = 'SlashDot Articles' publisher = 'Web' category = '' oldest_article = 7 conversion_options = {'linearize_tables' : True} max_articles_per_feed = 100 no_stylesheets = True #masthead_url = 'http://www.gtdtimes.com/images/GTDTimes_header.png' feeds = [ ('SlashDot', 'http://rss.slashdot.org/Slashdot/slashdot'), ] no_stylesheets = True keep_only_tags = [ dict(name='div',attrs={'class':'body'}) ] remove_tags = [ dict(name='div', attrs={'class':'article-foot'}) ] def get_article_url(self, article): return article.get('feedburner_origlink', None) Last edited by poloman; 09-09-2010 at 03:41 AM.

09-11-2010, 01:04 AM	#2684
TonytheBookworm Addict Posts: 264 Karma: 62 Join Date: May 2010 Device: kindle 2, kindle 3, Kindle fire	Starson17, If i wanted an if statement that checked if the parent was <div id='MainContent'> how would I go about doing it? would it be Code: mydaddy = item.parent if mydaddy.name = 'MainContent' ....... I seen examples of how you and others do a parent match for <a> and <p> and so forth but not for an actual div id... tag thanks Last edited by TonytheBookworm; 09-11-2010 at 01:59 AM. Reason: Changed question

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom column read ?	pchrist7	Calibre	2	10-04-2010 02:52 AM
Archive for custom screensavers	sleeplessdave	Amazon Kindle	1	07-07-2010 12:33 PM
How to back up preferences and custom recipes?	greenapple	Calibre	3	03-29-2010 05:08 AM
Donations for Custom Recipes	ddavtian	Calibre	5	01-23-2010 04:54 PM
Help understanding custom recipes	andersent	Calibre	0	12-17-2009 02:37 PM

09-10-2010, 01:11 PM	#2677
somedayson Member Posts: 13 Karma: 10 Join Date: Sep 2010 Device: K3	Thanks for all the previous help...got an awesome customized news reader on my new kindle thanks to Calibre and your help. It's been great to go into the recipes and add and remove sections of rss feeds from newspapers....awesome! One last one I can't seem to figure out...it's a hockey team's feeds. Maybe not of much interest to many, but perhaps posting the recipe could help someone else like it did for me as I read through all 176 pages at the time (crazy, right?) Here's the feed if someone would be willing to take a shot or help point me in the right direction. When I click on the two RSS links, they bring up some weird stuff I haven't seen before--sometimes, and then other times I get the articles. Web page: http://blackhawks.nhl.com/club/feedinfo.htm RSS #1: http://blackhawks.nhl.com/rss/top-stories.xml Rss#2: http://blackhawks.nhl.com/rss/news.xml I really appreciate everyone's help in learning this system! Thanks, Matt

09-10-2010, 01:32 PM	#2678
somedayson Member Posts: 13 Karma: 10 Join Date: Sep 2010 Device: K3	Some more info about the above request. I'm just just not sure how to pull up the print portion. Here's the example: http://blackhawks.nhl.com/club/news....rss-blackhawks (regular page) http://blackhawks.nhl.com/club/newsprint.htm?id=533848 (print version) You can see that I need to insert "print" right after news in the first feed and drop everything after the "&" in the first feed Here's the best rss to work from: http://blackhawks.nhl.com/club/newsi...location=/news Thanks again for any help anyone can provide, Matt

09-10-2010, 03:17 PM	#2680
somedayson Member Posts: 13 Karma: 10 Join Date: Sep 2010 Device: K3	Thanks Starson...I'm now getting about 10 "pages" on my kindle of the headlines and all the links beyond those. I've got Firefox and Firebug, and am trying lots of "keep only" and "remove only" tags but I can't quite find what the article content is labeled. I really appreciate the spirit of teaching and learning that happens here.