Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 08-23-2011, 10:49 AM   #1
rogerx
Enthusiast
rogerx doesn't litterrogerx doesn't litterrogerx doesn't litter
 
Posts: 29
Karma: 244
Join Date: Aug 2011
Location: North Pole, Alaska
Device: Kindle DXG
Thumbs down New Fairbanks Daily News-miner News Recipe -- Need Date inclusion only

Any suggestions pertaining to line #73... how do I just include only the DATE (story_item_date updated) of a span class and omit the rest of a div class?


I also need to figure out how to convert the titles of the news articles to a bold font style.

Once I polish this off, I figure I can then submit.

Code:
from calibre.web.feeds.news import BasicNewsRecipe

import re

class FairbanksDailyNewsminer(BasicNewsRecipe):
    title          = u'Fairbanks Daily News-miner'
    __author__ = 'Roger'
    oldest_article = 7
    max_articles_per_feed = 100

    description = ''''The voice of interior Alaska since 1903'''
    publisher   = 'http://www.newsminer.com/'
    category    = 'news, Alaska, Fairbanks'
    language    = 'en'
    #extra_css   = '''
    #                p{font-weight: normal;text-align: justify}
    #              '''

    remove_javascript = True
    use_embedded_content = False
    no_stylesheets = True
    language = 'en'
    encoding = 'utf8'
    conversion_options = {'linearize_tables':True}
    masthead_url = 'http://d2uh5w9wm14i0w.cloudfront.net/sites/635/assets/top_masthead_-_menu_pic.jpg'

    # I just need "story_item_date updated", trash the rest of the line!
    # <div class="signature_line"><span title="2011-08-22T10:35:58Z" class="story_item_date updated">Aug 22, 2011</span>&nbsp;|&nbsp;1463&nbsp;views&nbsp;|&nbsp;19&nbsp;<a href="/pages/full_story/push?article........class="signature_email_message"></span></div>

    #preprocess_regexps = [(re.compile(r'<span[^>]*addthis_separator*>'), lambda match: '') ]
    #preprocess_regexps = [(re.compile(r'span class="addthis_separator">|</span>'), lambda match: '') ]
    
    #preprocess_regexps = [
    #           (re.compile(r'<start>.*?<end>', re.IGNORECASE | re.DOTALL), lambda match : ''),
    #               ]

    keep_only_tags = [
                        dict(name='div', attrs={'class':'hnews hentry item'}),
                        dict(name='div', attrs={'class':'story_item_headline entry-title'}),
                        dict(name='span', attrs={'class':'story_item_date updated'}),
                        dict(name='div', attrs={'class':'full_story'})
                     ]
    #remove_tags = [
    #                dict(name='div', attrs={'class':'story_tools'}),
    #                dict(name='p', attrs={'class':'ad_label'}),
    #              ]
    remove_tags = [
                    dict(name='div', attrs={'class':'signature_line'}),
                    dict(name='div', attrs={'class':'addthis_toolbox addthis_default_style'}),
                    dict(name='div', attrs={'class':['addthis_toolbox','addthis_default_style']}),
                    dict(name='span', attrs={'class':'addthis_separator'}),
                    dict(name='div', attrs={'class':'related_content'}),
                    dict(name='div', attrs={'class':'comments_container'}),
                    #dict(name='div', attrs={'class':'signature_line'}),
                    dict(name='div', attrs={'class':'addthis_toolbox addthis_default_style'}),
                    dict(name='div', attrs={'id':'comments_container'})
                  ]

    #remove_tags_after = [
    #                        dict(name='div', attrs={'class':'advertisement'}),
    #                    ]

    #extra_css # tweak the appearance (ie. Change titles to bold!)
    
    # Uncomment the following feeds once Dates are included and Titles are bold!
    feeds = [
        (u'Alaska News', u'http://newsminer.com/rss/rss_feeds/alaska_news?content_type=article&tags=alaska_news&page_name=rss_feeds&instance=alaska_news')
     #   (u'Alaska News', u'http://newsminer.com/rss/rss_feeds/alaska_news?content_type=article&tags=alaska_news&page_name=rss_feeds&instance=alaska_news'),
     #   (u'Local News', u'http://newsminer.com/rss/rss_feeds/local_news?content_type=article&tags=local_news&page_name=rss_feeds&offset=0&instance=local_news'),
     #   (u'Business', u'http://newsminer.com/rss/rss_feeds/business_news?content_type=article&tags=business_news&page_name=rss_feeds&instance=business_news'),
     #   (u'Politics', u'http://newsminer.com/rss/rss_feeds/politics_news?content_type=article&tags=politics_news&page_name=rss_feeds&instance=politics_news'),
     #   (u'Sports', u'http://newsminer.com/rss/rss_feeds/sports_news?content_type=article&tags=sports_news&page_name=rss_feeds&instance=sports_news'),
     #   (u'Latitude 65 feed', u'http://newsminer.com/rss/rss_feeds/latitude_65?content_type=article&tags=latitude_65&page_name=rss_feeds&offset=0&instance=latitude_65'),
     #   (u'Sundays', u'http://newsminer.com/rss/rss_feeds/Sundays?content_type=article&tags=alaska_science_forum+scott_mccrea+interior_gardening+in_the_bush+judy_ferguson+book_reviews+theresa_bakker+judith_kleinfeld+interior_scrapbook+nuggets_comics+freeze_frame&page_name=rss_feeds&tag_inclusion=or&instance=Sundays'),
     #   (u'Outdoors', u'http://newsminer.com/rss/rss_feeds/Outdoors?content_type=article&tags=outdoors&page_name=rss_feeds&instance=Outdoors'),
     #   (u'Fairbanks Grizzlies', u'http://newsminer.com/rss/rss_feeds/fairbanks_grizzlies?content_type=article&tags=fairbanks_grizzlies&page_name=rss_feeds&instance=fairbanks_grizzlies'),
     #   (u'Newsminer', u'http://newsminer.com/rss/rss_feeds/Newsminer?content_type=article&tags=ted_stevens_bullets+ted_stevens+sports_news+business_news+fairbanks_grizzlies+dermot_cole_column+outdoors+alaska_science_forum+scott_mccrea+interior_gardening+in_the_bush+judy_ferguson+book_reviews+theresa_bakker+judith_kleinfeld+interior_scrapbook+nuggets_comics+freeze_frame&page_name=rss_feeds&tag_inclusion=or&instance=Newsminer'),
     #   (u'Opinion', u'http://newsminer.com/rss/rss_feeds/Opinion?content_type=article&tags=editorials&page_name=rss_feeds&instance=Opinion'),
     #   (u'Youth', u'http://newsminer.com/rss/rss_feeds/Youth?content_type=article&tags=youth&page_name=rss_feeds&instance=Youth'),
     #   (u'Dermot Cole Blog', u'http://newsminer.com/rss/rss_feeds/dermot_cole_blog+rss?content_type=blog+entry&sort_by=posted_on&user_ids=3015275&page_name=blogs_dermot_cole&limit=10&instance=dermot_cole_blog+rss'),
     #   (u'Dermot Cole Column', u'http://newsminer.com/rss/rss_feeds/Dermot_Cole_column?content_type=article&tags=dermot_cole_column&page_name=rss_feeds&instance=Dermot_Cole_column'),
     #   (u'Sarah Palin', u'http://newsminer.com/rss/rss_feeds/sarah_palin?content_type=article&tags=palin_in_the_news+palin_on_the_issues&page_name=rss_feeds&tag_inclusion=or&instance=sarah_palin')
           ]

Last edited by rogerx; 08-23-2011 at 11:01 AM. Reason: blah comment
rogerx is offline   Reply With Quote
Old 08-23-2011, 11:19 AM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by rogerx View Post
how do I just include only the DATE (story_item_date updated) of a span class and omit the rest of a div class?
Something like this:
Code:
date = self.tag_to_string(soup.find('span', attrs={'class':'story_item_date updated'}))
Starson17 is offline   Reply With Quote
Advert
Old 08-23-2011, 07:54 PM   #3
rogerx
Enthusiast
rogerx doesn't litterrogerx doesn't litterrogerx doesn't litter
 
Posts: 29
Karma: 244
Join Date: Aug 2011
Location: North Pole, Alaska
Device: Kindle DXG
Ah, many thanks! Sorry, I'm a Bash junky and not a perl/python, but am trying to patiently learn.

After sleeping on this and comparing it to the Anchorage Daily News (Official Kindle Version feed), I realized I only needed a title page with the published date. However, since individual stories are constantly updated and this date is the actual updated publish/re-edit date, I should probably just use this.

Now I just need to just google for changing fonts <h2> etc. Once I clean this file up, I'll publish for other users. Think I'm just going to leave many of the feeds commented-out (but still in the file in case others have interests) as the Anchorage Daily News (Official Kindle Version feed) only pushes News, Opinions, Sports, Outdoors and Letters to the Editor sections. Personally, like most, just want to news/facts.
rogerx is offline   Reply With Quote
Old 08-23-2011, 09:23 PM   #4
rogerx
Enthusiast
rogerx doesn't litterrogerx doesn't litterrogerx doesn't litter
 
Posts: 29
Karma: 244
Join Date: Aug 2011
Location: North Pole, Alaska
Device: Kindle DXG
I've done some research, and it looks like the above snippet is leading me into a more undesirable complex recipe file.

Even though I get almost 100% good results with a basic news recipe, to just clip the date from one (1) undesirable line with this snippet looks to require manual rewriting all of the Calibre functions for the entire HTML fetching & rendering operations. (Similar to the New York Times recipe.)

As such, a simple regexp (ie. preprocess_regexps calibre function) should be able to clip the date from the following line of html tags: (Note, undesirable tags occur after the date and immediately following </span> tag.)

Code:
# I just need "story_item_date updated", trash the rest of the line!
    # <div class="signature_line"><span title="2011-08-22T10:35:58Z" class="story_item_date updated">Aug 22, 2011</span>&nbsp;|&nbsp;1463&nbsp;views&nbsp;|&nbsp;19&nbsp;<a href="/pages/full_story/push?article........class="signature_email_message"></span></div>

    #preprocess_regexps
(... well, I need to read into more detailed regexp later.)

Last edited by rogerx; 08-23-2011 at 09:24 PM. Reason: grammar
rogerx is offline   Reply With Quote
Old 08-24-2011, 04:51 AM   #5
rogerx
Enthusiast
rogerx doesn't litterrogerx doesn't litterrogerx doesn't litter
 
Posts: 29
Karma: 244
Join Date: Aug 2011
Location: North Pole, Alaska
Device: Kindle DXG
I've just tested this recipe file on my Kindle DXG instead of FBReader and it looks really good! Better then what I thought as I was viewing the resulting .mobi file through FBReader -- and looks like FBReader was really screwing me up!

Viewing the .mobi file on my Kindle DXG and everything looks really good except for:

1) Article title should probably be bold font.

2) Newline bug looks to be really a FBReader bug. I don't see this on my Kindle Reader! ;-)

3) About the only anomaly, number of views and number of comments/posts along with pipe symbols persists.

4) The masthead_url image shows on my Kindle! (Another bug specific to FBReader. ;-)

5) I've got four feeds uncommented and thinking of uncommenting all or most of them. (I have to view wireless charges first.)

I'm posting this here, instead of under the newer post with the inline posting of this news recipe because it has yet to show up on the list for the past hour. :-(
rogerx is offline   Reply With Quote
Advert
Old 08-24-2011, 09:12 AM   #6
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
I've read your posts, but can't tell what you want. The snippet of code I posted was to extract the date for you so you can do something with it. Presumably, you want to display it womewhereYou didn't say what you want to do with it. It certainly doesn't require "manual rewriting all of the Calibre functions for the entire HTML fetching & rendering operations."

As for bolding the title, you can use extra_css.
As for "views and number of comments/posts along with pipe symbol," I think you're asking how to remove that "junk", and the answer is you use remove_tags.
Starson17 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
New York Daily News dabla75 Recipes 0 06-20-2011 01:09 PM
NY Daily News muggsly Recipes 1 03-21-2011 08:44 PM
Remove date from news title crisnoh Recipes 1 03-17-2011 01:07 PM
Custom Daily News Recipe mean_gene Recipes 0 12-27-2010 12:07 PM
News / periodicals date on the kindle prophet Calibre 3 12-04-2010 06:05 PM


All times are GMT -4. The time now is 06:35 AM.


MobileRead.com is a privately owned, operated and funded community.