The Bay Citizen - recipe help

noah · 09-24-2010, 02:32 PM

Thanks to TonytheBookworm whose helpful post got me started with a recipe for The Bay Citizen.

I modified the recipe to extract content from the regular (non-print) story pages, because I wanted the pictures which aren't included in the print versions.

Spoiler:

It works pretty well, but I have two questions/problems:

In both my version and Tony's, Calibre is forming the section menu using the <media:title> element from the feed, instead of the <title> element. How can I get it to use the <title> element, which is what it actually should be doing?
In my version, certain stories appear as complete gobbledygook -- huge strings of strange characters. Help!?

Starson17 · 09-24-2010, 02:59 PM

Quote:

Originally Posted by noah

In both my version and Tony's, Calibre is forming the section menu using the <media:title> element from the feed, instead of the <title> element. How can I get it to use the <title> element, which is what it actually should be doing?
In my version, certain stories appear as complete gobbledygook -- huge strings of strange characters. Help!?

1) You're using the default feed and article parsing system. It parses what it parses, and you'll have to override it to get different behavior. You can use parse_index() to read the feed page and scrape any title you want.

2) I've no idea. Post some links and samples. Use "print" in your recipe to see what's happening.

noah · 10-06-2010, 04:44 PM

Quote:

Originally Posted by Starson17

1) You're using the default feed and article parsing system. It parses what it parses, and you'll have to override it to get different behavior. You can use parse_index() to read the feed page and scrape any title you want.

OK, thanks. I don't think I have the skills for this, so if anyone else wants to take a stab at it, I'd be grateful!

Quote:

Originally Posted by Starson17

2) I've no idea. Post some links and samples. Use "print" in your recipe to see what's happening.

I'm attaching an example. Calibre generated this ebook based on the recipe mentioned above. See p. 43 of 63 ("Meg Whitman"). It's all garbage characters. Can you tell me why this is happening?

Also, where would I put "print" in the recipe? Sorry, I'm not very advanced and I don't see a "print" command mentioned in the documentation.

noah · 03-11-2011, 04:21 AM

The Bay Citizen recently added pagination to their site. Here's a new recipe that handles articles that span multiple pages. (Adapted from the Adventure Gamers recipe).

Spoiler:

Code:

from calibre.web.feeds.news import BasicNewsRecipe

class TheBayCitizen(BasicNewsRecipe):
    title                 = 'The Bay Citizen'
    language              = 'en'
    __author__            = 'noah'
    description           = 'The Bay Citizen'
    publisher             = 'The Bay Citizen'
    INDEX                 = u'http://www.baycitizen.org'
    category              = 'news'
    oldest_article        = 2
    max_articles_per_feed = 20
    no_stylesheets        = True
    masthead_url          = 'http://media.baycitizen.org/images/layout/logo1.png'
    feeds                 = [('Main Feed', 'http://www.baycitizen.org/feeds/stories/')]
    keep_only_tags        = [dict(name='div', attrs={'class':'story'})]
    remove_tags           = [
                             dict(name='div', attrs={'class':'socialBar'}),
                             dict(name='div', attrs={'id':'text-resize'}),
                             dict(name='div', attrs={'class':'story relatedContent'}),
                             dict(name='div', attrs={'id':'comment_status_loading'}),
                            ]
    
    def append_page(self, soup, appendtag, position):
        pager = soup.find('a',attrs={'class':'stry-next'})
        if pager:
           nexturl = self.INDEX + pager['href']
           soup2 = self.index_to_soup(nexturl)
           texttag = soup2.find('div', attrs={'class':'body'})
           for it in texttag.findAll(style=True):
               del it['style']
           newpos = len(texttag.contents)
           self.append_page(soup2,texttag,newpos)
           texttag.extract()
           appendtag.insert(position,texttag)


    def preprocess_html(self, soup):
        for item in soup.findAll(style=True):
            del item['style']
        self.append_page(soup, soup.body, 3)
        garbage = soup.findAll(id='story-pagination')
        [trash.extract() for trash in garbage]
        garbage = soup.findAll('em', 'cont-from-prev')
        [trash.extract() for trash in garbage]
        return soup

09-24-2010, 02:32 PM	#1
noah Junior Member Posts: 6 Karma: 10 Join Date: Sep 2010 Device: Kindle	The Bay Citizen - recipe help Thanks to TonytheBookworm whose helpful post got me started with a recipe for The Bay Citizen. I modified the recipe to extract content from the regular (non-print) story pages, because I wanted the pictures which aren't included in the print versions. Spoiler: # this block is pretty much standard on all recipes #---------------------------------------------------------------------------------------------------------- from calibre.web.feeds.news import BasicNewsRecipe class AdvancedUserRecipe1282101454(BasicNewsRecipe): title = 'The Bay Citizen' language = 'en' __author__ = 'TonytheBookworm and noah' description = 'The Bay Citizen' publisher = 'The Bay Citizen' category = 'news' oldest_article = 1 # USE THIS TO DETERMINE HOW FAR BACK YOU WANNA GO IN THE FEED DATE WISE max_articles_per_feed = 20 # USE TO DETERMINE HOW MANY ARTICLES YOU WISH TO READ PER FEED no_stylesheets = True # TURNS OFF JAVASCRIPT masthead_url = 'http://media.baycitizen.org/images/layout/logo1.png' #PUTS NICE LOGO ON MAIN MENU PAGE #--------------------------------------------------------------------------------------------------------- #here we tell the recipe what feed(s) we wish to obtain #----------------------------------------------------------------------------------------- feeds = [ ('Main Feed', 'http://www.baycitizen.org/feeds/stories/'), ] #------------------------------------------------------------------------------------------ keep_only_tags = [dict(name='div', attrs={'class':'story'})] remove_tags = [dict(name='div', attrs={'class':'socialBar'})] It works pretty well, but I have two questions/problems: In both my version and Tony's, Calibre is forming the section menu using the <media:title> element from the feed, instead of the <title> element. How can I get it to use the <title> element, which is what it actually should be doing? In my version, certain stories appear as complete gobbledygook -- huge strings of strange characters. Help!?

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Seriously thoughtful How do you plan to prove you are a citizen?	rhadin	Lounge	270	05-25-2010 01:48 PM
A newbie who is a senior citizen!	drnvs	Introduce Yourself	14	03-28-2010 04:36 PM
Other Non-Fiction Campbell, John W.: Tribesman, Barbarian, And Citizen. v1. 26 Dec 08	Dr. Drib	BBeB/LRF Books (offline)	0	12-26-2008 06:52 PM
Herrick, Robert: The Memoirs of an American Citizen. v1. 11 July 07	Anais9000	BBeB/LRF Books	0	07-11-2007 01:27 PM
Citizen develops ferroelectric LCD	Alexander Turcic	News	2	04-14-2006 10:40 AM

Advert