New Fairbanks Daily News-miner News Recipe -- Need Date inclusion only

rogerx · 08-23-2011, 10:49 AM

Any suggestions pertaining to line #73... how do I just include only the DATE (story_item_date updated) of a span class and omit the rest of a div class?

I also need to figure out how to convert the titles of the news articles to a bold font style.

Once I polish this off, I figure I can then submit.

Code:

from calibre.web.feeds.news import BasicNewsRecipe

import re

class FairbanksDailyNewsminer(BasicNewsRecipe):
    title          = u'Fairbanks Daily News-miner'
    __author__ = 'Roger'
    oldest_article = 7
    max_articles_per_feed = 100

    description = ''''The voice of interior Alaska since 1903'''
    publisher   = 'http://www.newsminer.com/'
    category    = 'news, Alaska, Fairbanks'
    language    = 'en'
    #extra_css   = '''
    #                p{font-weight: normal;text-align: justify}
    #              '''

    remove_javascript = True
    use_embedded_content = False
    no_stylesheets = True
    language = 'en'
    encoding = 'utf8'
    conversion_options = {'linearize_tables':True}
    masthead_url = 'http://d2uh5w9wm14i0w.cloudfront.net/sites/635/assets/top_masthead_-_menu_pic.jpg'

    # I just need "story_item_date updated", trash the rest of the line!
    # <div class="signature_line"><span title="2011-08-22T10:35:58Z" class="story_item_date updated">Aug 22, 2011</span>&nbsp;|&nbsp;1463&nbsp;views&nbsp;|&nbsp;19&nbsp;<a href="/pages/full_story/push?article........class="signature_email_message"></span></div>

    #preprocess_regexps = [(re.compile(r'<span[^>]*addthis_separator*>'), lambda match: '') ]
    #preprocess_regexps = [(re.compile(r'span class="addthis_separator">|</span>'), lambda match: '') ]
    
    #preprocess_regexps = [
    #           (re.compile(r'<start>.*?<end>', re.IGNORECASE | re.DOTALL), lambda match : ''),
    #               ]

    keep_only_tags = [
                        dict(name='div', attrs={'class':'hnews hentry item'}),
                        dict(name='div', attrs={'class':'story_item_headline entry-title'}),
                        dict(name='span', attrs={'class':'story_item_date updated'}),
                        dict(name='div', attrs={'class':'full_story'})
                     ]
    #remove_tags = [
    #                dict(name='div', attrs={'class':'story_tools'}),
    #                dict(name='p', attrs={'class':'ad_label'}),
    #              ]
    remove_tags = [
                    dict(name='div', attrs={'class':'signature_line'}),
                    dict(name='div', attrs={'class':'addthis_toolbox addthis_default_style'}),
                    dict(name='div', attrs={'class':['addthis_toolbox','addthis_default_style']}),
                    dict(name='span', attrs={'class':'addthis_separator'}),
                    dict(name='div', attrs={'class':'related_content'}),
                    dict(name='div', attrs={'class':'comments_container'}),
                    #dict(name='div', attrs={'class':'signature_line'}),
                    dict(name='div', attrs={'class':'addthis_toolbox addthis_default_style'}),
                    dict(name='div', attrs={'id':'comments_container'})
                  ]

    #remove_tags_after = [
    #                        dict(name='div', attrs={'class':'advertisement'}),
    #                    ]

    #extra_css # tweak the appearance (ie. Change titles to bold!)
    
    # Uncomment the following feeds once Dates are included and Titles are bold!
    feeds = [
        (u'Alaska News', u'http://newsminer.com/rss/rss_feeds/alaska_news?content_type=article&tags=alaska_news&page_name=rss_feeds&instance=alaska_news')
     #   (u'Alaska News', u'http://newsminer.com/rss/rss_feeds/alaska_news?content_type=article&tags=alaska_news&page_name=rss_feeds&instance=alaska_news'),
     #   (u'Local News', u'http://newsminer.com/rss/rss_feeds/local_news?content_type=article&tags=local_news&page_name=rss_feeds&offset=0&instance=local_news'),
     #   (u'Business', u'http://newsminer.com/rss/rss_feeds/business_news?content_type=article&tags=business_news&page_name=rss_feeds&instance=business_news'),
     #   (u'Politics', u'http://newsminer.com/rss/rss_feeds/politics_news?content_type=article&tags=politics_news&page_name=rss_feeds&instance=politics_news'),
     #   (u'Sports', u'http://newsminer.com/rss/rss_feeds/sports_news?content_type=article&tags=sports_news&page_name=rss_feeds&instance=sports_news'),
     #   (u'Latitude 65 feed', u'http://newsminer.com/rss/rss_feeds/latitude_65?content_type=article&tags=latitude_65&page_name=rss_feeds&offset=0&instance=latitude_65'),
     #   (u'Sundays', u'http://newsminer.com/rss/rss_feeds/Sundays?content_type=article&tags=alaska_science_forum+scott_mccrea+interior_gardening+in_the_bush+judy_ferguson+book_reviews+theresa_bakker+judith_kleinfeld+interior_scrapbook+nuggets_comics+freeze_frame&page_name=rss_feeds&tag_inclusion=or&instance=Sundays'),
     #   (u'Outdoors', u'http://newsminer.com/rss/rss_feeds/Outdoors?content_type=article&tags=outdoors&page_name=rss_feeds&instance=Outdoors'),
     #   (u'Fairbanks Grizzlies', u'http://newsminer.com/rss/rss_feeds/fairbanks_grizzlies?content_type=article&tags=fairbanks_grizzlies&page_name=rss_feeds&instance=fairbanks_grizzlies'),
     #   (u'Newsminer', u'http://newsminer.com/rss/rss_feeds/Newsminer?content_type=article&tags=ted_stevens_bullets+ted_stevens+sports_news+business_news+fairbanks_grizzlies+dermot_cole_column+outdoors+alaska_science_forum+scott_mccrea+interior_gardening+in_the_bush+judy_ferguson+book_reviews+theresa_bakker+judith_kleinfeld+interior_scrapbook+nuggets_comics+freeze_frame&page_name=rss_feeds&tag_inclusion=or&instance=Newsminer'),
     #   (u'Opinion', u'http://newsminer.com/rss/rss_feeds/Opinion?content_type=article&tags=editorials&page_name=rss_feeds&instance=Opinion'),
     #   (u'Youth', u'http://newsminer.com/rss/rss_feeds/Youth?content_type=article&tags=youth&page_name=rss_feeds&instance=Youth'),
     #   (u'Dermot Cole Blog', u'http://newsminer.com/rss/rss_feeds/dermot_cole_blog+rss?content_type=blog+entry&sort_by=posted_on&user_ids=3015275&page_name=blogs_dermot_cole&limit=10&instance=dermot_cole_blog+rss'),
     #   (u'Dermot Cole Column', u'http://newsminer.com/rss/rss_feeds/Dermot_Cole_column?content_type=article&tags=dermot_cole_column&page_name=rss_feeds&instance=Dermot_Cole_column'),
     #   (u'Sarah Palin', u'http://newsminer.com/rss/rss_feeds/sarah_palin?content_type=article&tags=palin_in_the_news+palin_on_the_issues&page_name=rss_feeds&tag_inclusion=or&instance=sarah_palin')
           ]

Starson17 · 08-23-2011, 11:19 AM

Quote:

Originally Posted by rogerx

how do I just include only the DATE (story_item_date updated) of a span class and omit the rest of a div class?

Something like this:

Code:

date = self.tag_to_string(soup.find('span', attrs={'class':'story_item_date updated'}))

rogerx · 08-23-2011, 07:54 PM

Ah, many thanks! Sorry, I'm a Bash junky and not a perl/python, but am trying to patiently learn.

After sleeping on this and comparing it to the Anchorage Daily News (Official Kindle Version feed), I realized I only needed a title page with the published date. However, since individual stories are constantly updated and this date is the actual updated publish/re-edit date, I should probably just use this.

Now I just need to just google for changing fonts <h2> etc. Once I clean this file up, I'll publish for other users. Think I'm just going to leave many of the feeds commented-out (but still in the file in case others have interests) as the Anchorage Daily News (Official Kindle Version feed) only pushes News, Opinions, Sports, Outdoors and Letters to the Editor sections. Personally, like most, just want to news/facts.

rogerx · 08-23-2011, 09:23 PM

I've done some research, and it looks like the above snippet is leading me into a more undesirable complex recipe file.

Even though I get almost 100% good results with a basic news recipe, to just clip the date from one (1) undesirable line with this snippet looks to require manual rewriting all of the Calibre functions for the entire HTML fetching & rendering operations. (Similar to the New York Times recipe.)

As such, a simple regexp (ie. preprocess_regexps calibre function) should be able to clip the date from the following line of html tags: (Note, undesirable tags occur after the date and immediately following </span> tag.)

Code:

# I just need "story_item_date updated", trash the rest of the line!
    # <div class="signature_line"><span title="2011-08-22T10:35:58Z" class="story_item_date updated">Aug 22, 2011</span>&nbsp;|&nbsp;1463&nbsp;views&nbsp;|&nbsp;19&nbsp;<a href="/pages/full_story/push?article........class="signature_email_message"></span></div>

    #preprocess_regexps

(... well, I need to read into more detailed regexp later.)

rogerx · 08-24-2011, 04:51 AM

I've just tested this recipe file on my Kindle DXG instead of FBReader and it looks really good! Better then what I thought as I was viewing the resulting .mobi file through FBReader -- and looks like FBReader was really screwing me up!

Viewing the .mobi file on my Kindle DXG and everything looks really good except for:

1) Article title should probably be bold font.

2) Newline bug looks to be really a FBReader bug. I don't see this on my Kindle Reader! ;-)

3) About the only anomaly, number of views and number of comments/posts along with pipe symbols persists.

4) The masthead_url image shows on my Kindle! (Another bug specific to FBReader. ;-)

5) I've got four feeds uncommented and thinking of uncommenting all or most of them. (I have to view wireless charges first.)

I'm posting this here, instead of under the newer post with the inline posting of this news recipe because it has yet to show up on the list for the past hour. :-(

Starson17 · 08-24-2011, 09:12 AM

I've read your posts, but can't tell what you want. The snippet of code I posted was to extract the date for you so you can do something with it. Presumably, you want to display it womewhereYou didn't say what you want to do with it. It certainly doesn't require "manual rewriting all of the Calibre functions for the entire HTML fetching & rendering operations."

As for bolding the title, you can use extra_css.
As for "views and number of comments/posts along with pipe symbol," I think you're asking how to remove that "junk", and the answer is you use remove_tags.

08-23-2011, 09:23 PM	#4
rogerx Enthusiast Posts: 29 Karma: 244 Join Date: Aug 2011 Location: North Pole, Alaska Device: Kindle DXG	I've done some research, and it looks like the above snippet is leading me into a more undesirable complex recipe file. Even though I get almost 100% good results with a basic news recipe, to just clip the date from one (1) undesirable line with this snippet looks to require manual rewriting all of the Calibre functions for the entire HTML fetching & rendering operations. (Similar to the New York Times recipe.) As such, a simple regexp (ie. preprocess_regexps calibre function) should be able to clip the date from the following line of html tags: (Note, undesirable tags occur after the date and immediately following </span> tag.) Code: # I just need "story_item_date updated", trash the rest of the line! # <div class="signature_line"><span title="2011-08-22T10:35:58Z" class="story_item_date updated">Aug 22, 2011</span> \| 1463 views \| 19 <a href="/pages/full_story/push?article........class="signature_email_message"></span></div> #preprocess_regexps (... well, I need to read into more detailed regexp later.) Last edited by rogerx; 08-23-2011 at 09:24 PM. Reason: grammar

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
New York Daily News	dabla75	Recipes	0	06-20-2011 01:09 PM
NY Daily News	muggsly	Recipes	1	03-21-2011 08:44 PM
Remove date from news title	crisnoh	Recipes	1	03-17-2011 01:07 PM
Custom Daily News Recipe	mean_gene	Recipes	0	12-27-2010 12:07 PM
News / periodicals date on the kindle	prophet	Calibre	3	12-04-2010 06:05 PM

08-23-2011, 07:54 PM	#3
rogerx Enthusiast Posts: 29 Karma: 244 Join Date: Aug 2011 Location: North Pole, Alaska Device: Kindle DXG	Ah, many thanks! Sorry, I'm a Bash junky and not a perl/python, but am trying to patiently learn. After sleeping on this and comparing it to the Anchorage Daily News (Official Kindle Version feed), I realized I only needed a title page with the published date. However, since individual stories are constantly updated and this date is the actual updated publish/re-edit date, I should probably just use this. Now I just need to just google for changing fonts <h2> etc. Once I clean this file up, I'll publish for other users. Think I'm just going to leave many of the feeds commented-out (but still in the file in case others have interests) as the Anchorage Daily News (Official Kindle Version feed) only pushes News, Opinions, Sports, Outdoors and Letters to the Editor sections. Personally, like most, just want to news/facts.

08-24-2011, 04:51 AM	#5
rogerx Enthusiast Posts: 29 Karma: 244 Join Date: Aug 2011 Location: North Pole, Alaska Device: Kindle DXG	I've just tested this recipe file on my Kindle DXG instead of FBReader and it looks really good! Better then what I thought as I was viewing the resulting .mobi file through FBReader -- and looks like FBReader was really screwing me up! Viewing the .mobi file on my Kindle DXG and everything looks really good except for: 1) Article title should probably be bold font. 2) Newline bug looks to be really a FBReader bug. I don't see this on my Kindle Reader! ;-) 3) About the only anomaly, number of views and number of comments/posts along with pipe symbols persists. 4) The masthead_url image shows on my Kindle! (Another bug specific to FBReader. ;-) 5) I've got four feeds uncommented and thinking of uncommenting all or most of them. (I have to view wireless charges first.) I'm posting this here, instead of under the newer post with the inline posting of this news recipe because it has yet to show up on the list for the past hour. :-(

08-24-2011, 09:12 AM	#6
Starson17 Wizard Posts: 4,004 Karma: 177841 Join Date: Dec 2009 Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T	I've read your posts, but can't tell what you want. The snippet of code I posted was to extract the date for you so you can do something with it. Presumably, you want to display it womewhereYou didn't say what you want to do with it. It certainly doesn't require "manual rewriting all of the Calibre functions for the entire HTML fetching & rendering operations." As for bolding the title, you can use extra_css. As for "views and number of comments/posts along with pipe symbol," I think you're asking how to remove that "junk", and the answer is you use remove_tags.

Advert

Advert