06-02-2010, 07:59 AM | #2026 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Now, you want to know how to do it - right? If I get some time, I'll think about it. I did something similar with some Olympics recipes where I used regex matching to find URLs embedded inside a script. I'd probably start the way I always do, and use preprocess_html and print the soup - then make sure that you are capturing the form and the multiple page links. Get the page links into a list. Then see if you can rewrite append_page to cycle through that list and build the new page, except you don't need to do it recursively as you've got all the links already in the list you're processing. (That's just off the top of my head.) |
|
06-02-2010, 11:00 AM | #2027 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
|
|
06-02-2010, 02:09 PM | #2028 | |
Member
Posts: 16
Karma: 10
Join Date: May 2010
Location: Southern California
Device: JetBook-Lite
|
Quote:
|
|
06-02-2010, 04:57 PM | #2029 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
http://feeds.feedburner.com/Tweaktow...s20?format=xml Perhaps it's my security settings? Is this the right feed? |
|
06-02-2010, 06:04 PM | #2030 |
Member
Posts: 16
Karma: 10
Join Date: May 2010
Location: Southern California
Device: JetBook-Lite
|
It's a good feed...
http://feeds.feedburner.com/Tweaktow...s20?format=xml so is the non-xml. http://feeds.feedburner.com/Tweaktow...AndGuidesRss20 |
06-02-2010, 09:29 PM | #2031 |
Junior Member
Posts: 7
Karma: 10
Join Date: Apr 2010
Device: sony
|
Townhall recipe
http://townhall.com/
I copied dwanthny's custom recipe from the American Thinker. I replaced the sections with references for townhall instead of american thinker. It downloads the titles of articles but not the body of the article. There is no username/password to access the webpages. Any help would be greatly appreciated. recipe: __license__ = 'GPL v3' __copyright__ = '2010, Firstname Lastname <emailaddress at domain.com>' ''' http://townhall.com ''' from calibre.web.feeds.news import BasicNewsRecipe class Townhall(BasicNewsRecipe): title = u'Townhall' description = "Townhall is a daily internet publication devoted to the thoughtful exploration of issues of importance to Americans." __author__ = 'Walt Anthony' publisher = 'Thomas Lifson' category = 'news, politics, USA' oldest_article = 4 #days max_articles_per_feed = 50 summary_length = 150 language = 'en' remove_javascript = True no_stylesheets = True conversion_options = { 'comment' : description , 'tags' : category , 'publisher' : publisher , 'language' : language , 'linearize_tables' : True } remove_tags = [ dict(name=['table', 'iframe', 'embed', 'object']) ] remove_tags_after = dict(name='div', attrs={'class':'article_body'}) feeds = [(u'http://rss.townhall.com/blogs/main'), (u'http://rss.townhall.com/columnists/all') ] def print_version(self, url): return url + '?page=full' Last edited by zelda_pinwheel; 06-03-2010 at 08:57 AM. Reason: to remove personal information at request of member |
06-03-2010, 05:34 AM | #2032 | |
US Navy, Retired
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
|
Quote:
Second, I looked in my working area and I had a recipe just about complete for the columnists but the blogs eluded me because they use java to print the blog entries. If you replace the above with the code below you will be in the ball park for the columnists feed. I lost interest in it so when you manage to get it working take credit and submit it for others to use. I attached the favicon for the site that you can add to the zip file when you upload it here. Good Luck. Code:
keep_only_tags = [ dict(name='div', attrs={'class':'authorblock'}), dict(name='div', attrs={'id':'columnBody'}) ] remove_tags_after = dict(name='div', attrs={'id':'columnBody'}) remove_tags = [ dict(name=['iframe', 'img', 'embed', 'object','center','script','form']), dict(name='div', attrs={'id':['ShareText', 'Externa', 'Toolbox', 'ctl00_cphMain_cbComments_dlComments_ctl01_ctl00_Content', 'ArticleContainer', 'shirttail', 'comments_container', 'ctl00_cphMain_cbComments_dvReadAll', 'footer']}) ] feeds = [(u'TownHall Columnists', u'http://rss.townhall.com/columnists/all')] def print_version(self, url): return url + '&page=full' |
|
06-03-2010, 10:22 AM | #2033 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
|
06-03-2010, 11:04 AM | #2034 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
|
|
06-03-2010, 01:19 PM | #2035 | |
Member
Posts: 16
Karma: 10
Join Date: May 2010
Location: Southern California
Device: JetBook-Lite
|
Quote:
here is a multi-page article from the feed: an 8 page PC case review http://www.tweaktown.com/reviews/332...ent=FeedBurner Look at the layout - an arrow button for the next page (1st target) or the navigation box that contains all the links for the 8 pages. I think scraping the nav. box would better cause that would also work for pcper.com thanks |
|
06-03-2010, 02:54 PM | #2036 | ||
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Don't worry, you're in good company. Quote:
Code:
pager = soup.find('a',attrs={'class':'next'}) if pager: nexturl = pager.a['href'] Code:
pager = soup.find('div',attrs={'class':'toolbar_fat_next'}) if pager: nexturl = self.INDEX + pager.a['href'] You need to change nexturl = pager.a['href'] to: Code:
nexturl = pager['href'] Yep - That does it. There's still lots of junk in my output, but it's definitely pulling multipages. My recipe may be slightly different from yours, but I think that should get you on your way. Last edited by Starson17; 06-03-2010 at 03:51 PM. |
||
06-03-2010, 04:40 PM | #2037 | |
Member
Posts: 16
Karma: 10
Join Date: May 2010
Location: Southern California
Device: JetBook-Lite
|
Quote:
A question about preprocess_html part. What does the "3" represent in this line? Code:
self.append_page(soup, soup.body, 3) I need to apply this to the pcper.com site now , it's a little more tricky so it might need a different approach. Thanks again. |
|
06-03-2010, 04:54 PM | #2038 | ||
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
It's saying to insert the text at the 3rd tag position. You can reference locations in Soup by labels (most common) or by tag position number (as above). Quote:
|
||
06-03-2010, 10:23 PM | #2039 |
Junior Member
Posts: 5
Karma: 10
Join Date: Mar 2010
Device: Kindle DX
|
Washington Times....
Once again I bow to the gurus! I could use some help on the Washington times recipe. I cobbled this one together below and it worked for quite some time, but now the Washington times has changed their format for their page..... any assistance would be greatly apperciated.
__license__ = 'GPL v3' ''' washingtontimes.com ''' from calibre.web.feeds.news import BasicNewsRecipe class WashingtonTimes(BasicNewsRecipe): title = 'Washington Times' __author__ = 'Kos Semonski' description = 'Daily newspaper' publisher = 'News World Communications, Inc.' category = 'news, politics, USA' oldest_article = 2 max_articles_per_feed = 15 no_stylesheets = True encoding = 'utf8' use_embedded_content = False language = 'en' masthead_url = 'http://media.washingtontimes.com/media/img/TWTlogo.gif' extra_css = ' body{font-family: Arial,Helvetica,sans-serif } img{margin-bottom: 0.4em} ' conversion_options = { 'comment' : description , 'tags' : category , 'publisher' : publisher , 'language' : language } def get_feeds(self): return [(u'Headlines', u'http://www.washingtontimes.com/rss/headlines/news/headlines/'), (u'Editor Favs', u'http://www.washingtontimes.com/rss/headlines/news/editor-favorites/'), (u'Politics', u'http://www.washingtontimes.com/rss/headlines/news/politics/'), (u'National', u'http://www.washingtontimes.com/rss/headlines/news/national/'), (u'World', u'http://www.washingtontimes.com/rss/headlines/news/world/'), (u'Business', u'http://www.washingtontimes.com/rss/headlines/news/business/'), (u'Technology', u'http://www.washingtontimes.com/rss/headlines/news/technology/'), (u'Editorials', u'http://www.washingtontimes.com/rss/headlines/opinion/editorials/') ] def print_version(self, url): return url + '/print/' |
06-03-2010, 11:05 PM | #2040 | |
Junior Member
Posts: 5
Karma: 10
Join Date: Apr 2010
Device: Kindle2 and Astak EZ Reader Pocket Pro
|
Request for recipe help
My original post seems to have gotten caught in the fray so I will repost this. I apologize if I missed any responses. Thanks!
Quote:
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Custom column read ? | pchrist7 | Calibre | 2 | 10-04-2010 02:52 AM |
Archive for custom screensavers | sleeplessdave | Amazon Kindle | 1 | 07-07-2010 12:33 PM |
How to back up preferences and custom recipes? | greenapple | Calibre | 3 | 03-29-2010 05:08 AM |
Donations for Custom Recipes | ddavtian | Calibre | 5 | 01-23-2010 04:54 PM |
Help understanding custom recipes | andersent | Calibre | 0 | 12-17-2009 02:37 PM |