LA Weekly - Trouble - Full articles?

kidblue · 10-07-2010, 12:08 AM

I tried my first recipe today with the L.A. Weekly's RSS feeds.

Unfortunately, their feeds do not link to entire articles, so I was left with a smattering of "one liners" and articles mostly awash with ads, etc.

Has anyone done an L.A. Weekly recipe or is there any kind of template for use specifically with their feeds? I've been reading up on cooking my own, but I can't quite put it to use with their system.

Thanks,

Noah

TonytheBookworm · 10-07-2010, 10:18 AM

Original url is: http://www.laweekly.com/2010-10-07/f...ervation-road/

The print version is kinda tough to get but we can fix that.

print url is: http://www.laweekly.com/content/printVersion/1080621/

just use something along the lines of this:

Spoiler:

kidblue · 10-07-2010, 12:18 PM

Muchas gracias.
I'm trying hard, but still lost - Just adding the above doesn't reference specific articles, which need to be pulled.

Plus, I'm a little boggled as to how to reference the above code to multiple feeds - i.e. the Music, Moves, Calendar sections, etc.

I'm not looking for a handout of a "free" recipe, I'm into learning, but my days of coding Q-BASIC are a little bit behind me

Spoiler:

TonytheBookworm · 10-07-2010, 12:43 PM

I'll look at it when I get time more than likely the weekend. I am swamped in work right now

p.s. there is a little more to it than the code you posted. Again If i get time I will look at it. It shouldn't be too hard to do. as for the other links it depends if they follow the same formatting of the ../content/printVersion/IDnumber or not. If they do then that would be rather simple. If the url is totally different then a series of if statements might be utilized to check what the index url is and then work accordingly.

In the meantime consider getting these things
1) Ultra-Edit
2) Firebug for firefox

Without the above programs (my personal opinion) you are chasing your tail trying to find code you want. With ultra-edit you can simply take and cheat like i do and search the built in recipes for code. For example you could look up print_version and you would see all the recipes that use that and see how they did it and why they did it that way. Then with firebug you can right click on the element in the html and you can get its corresponding tag. for instance if the content is only in a div tag with the class name of content. like <div class='content'> blah blah blah </div>
and we don't want anything else then we could use keep_only_tags and so forth. see what you can come up with in the meantime and i will help when i can.

kidblue · 10-07-2010, 01:22 PM

Thanks for the vote of confidence and offer to help.

I've actually been playing with Ultra Edit, but I guess the problem is the obvious one: I don't know what code or tags will specifically work with that site. I've been trolling the HTML, and the tags are obvious, but stuff like "print_version" over a whole site is what I'm losing. I guess I may have picked a tough site since I'll have to have "following" code to the full stories (as opposed to a full RSS feed), but the L.A. Weekly is so totally awesome (and unwieldy), it's a necessity!

Thanks again for your help and I appreciate the tutorial! Anything you can offer, I'm very appreciative.

TonytheBookworm · 10-07-2010, 11:12 PM

Hey Starson17 or anyone else for that matter. How do you check a mechanized follow to make sure it is a valid link? more specifically if I have a combination of feeds that mostly follow the url_regex of .*?\\/content\\/printVersion
but some of the feeds do not have that link inside. How do i test that?
I keep getting linknotfound errors on the event feeds because they do not contain a /content/printVersion link in them. In that cause I would like it to simply return the calling url.

here is the code I have thus far. Everything works except the music and events feeds because of the above mentioned issue.
Thanks.
Here is the section i'm having issues with

Spoiler:

and here is the whole code

Spoiler:

Code:

#!/usr/bin/env  python
__license__     = 'GPL v3'
__author__      = 'Tony Stegall'
__copyright__   = '2010, Tony Stegall or Tonythebookworm on mobileread.com'
__version__     = 'v1.01'
__date__        = '07, October 2010'
__description__ = 'La weekly mag'

'''
http://www.laweekly.com
'''

from calibre.web.feeds.news import BasicNewsRecipe
from calibre.ptempfile import PersistentTemporaryFile

class LaWeekly(BasicNewsRecipe):
    __author__    = 'Tony Stegall'
    description   = 'La Weekly Mag'
    cover_url     = 'http://assets.laweekly.com/img/citylogo-lg.png'
    

    title          = 'La WeeklyMag '
    publisher      = 'Laweekly.com'
    category       = 'News,US'

    language       = 'en'
    timefmt        = '[%a, %d %b, %Y]'

    oldest_article        = 15
    max_articles_per_feed = 25
    use_embedded_content  = False
    

    remove_javascript     = True
    ######################################################################################################################
    '''
    We need to take and find all instances of /content/printVersion/
    So in order to do this we take and setup a temp list
    Then we turn on the flag to tell calibre/beautifulsoup that the articles are obfuscated
    Then we take and get the obfuscated article (in our case the print version)
    We take and create a browser and let calibre do all the work for us. It will open an internal browser and follow
    then links that match the regular expression of .*?(\\/)(content)(\\/)(printVersion)(\\/)
    so basically any link that looks like this /content/printVersion/
    it takes and writes all the information to a temp html file.  that the recipe/calibre will parse from.
    And thats all that is needed for this recipe.
    '''

    temp_files = []
    articles_are_obfuscated = True

    def get_obfuscated_article(self, url):
        br = self.get_browser()
        print 'THE CURRENT URL IS: ', url
        br.open(url)
        response = br.follow_link(url_regex='.*?(\\/)(content)(\\/)(printVersion)(\\/)', nr = 0)
        
        if response is None:
           response = br.follow_link(url, nr=0)
        html = response.read()
        
        self.temp_files.append(PersistentTemporaryFile('_fa.html'))
        self.temp_files[-1].write(html)
        self.temp_files[-1].close()
        return self.temp_files[-1].name

    ######################################################################################################################

    feeds          = [
                       (u'Complete Issue', u'http://www.laweekly.com/syndication/issue/'),
                       (u'News', u'http://www.laweekly.com/syndication/section/news/'),
                       (u'Music', u'http://www.laweekly.com/syndication/section/music/'),
                       (u'Movies', u'http://www.laweekly.com/syndication/section/film/'),
                       (u'Restaurants', u'http://www.laweekly.com/syndication/section/dining/'),
                       (u'Music Events', u'http://laweekly.com/syndication/events?type=music'),
                       (u'Calendar Events', u'http://laweekly.com/syndication/events'),
                       (u'Restaurant Guide', u'http://laweekly.com/syndication/restaurants/search/'),
                       
                     ]

kidblue · 10-08-2010, 12:58 PM

Tony,
This is above-and-beyond! Thanks not only for the recipe but for the included notations as to how it got put together. It was specifically "content/printVersion" that had thrown me for a loop and I (obviously) have no idea how to invoke something similar for the feeds that don't have it. Is this an instance where you can simply have the whole RSS feed pulled via a mechanized "lookup" for a link?
I'm interested to know how to attack the "lack" of "content/printVersion", as well. Thanks so much again.
Noah

TonytheBookworm · 10-08-2010, 01:38 PM

Quote:

Originally Posted by kidblue

Tony,
This is above-and-beyond! Thanks not only for the recipe but for the included notations as to how it got put together. It was specifically "content/printVersion" that had thrown me for a loop and I (obviously) have no idea how to invoke something similar for the feeds that don't have it. Is this an instance where you can simply have the whole RSS feed pulled via a mechanized "lookup" for a link?
I'm interested to know how to attack the "lack" of "content/printVersion", as well. Thanks so much again.
Noah

Well, I know for certain that the links that DO NOT have the /content/printVersion are ones that have ..../event/.... in them
and so forth like that but I wanted to make it more sufficient instead of doing a bunch of if statements. I wanted to simply check to see if linknotfound is thrown. I will work on it more tonight but mechanize is very new to me so it will take a lot of plug and chug.

Starson17 · 10-08-2010, 01:40 PM

Have you read this?

kidblue · 10-08-2010, 01:54 PM

Does the above suggest that "use print version" take the place of /content/printVersion?

Tony, thanks again for looking into this and I'm interested to know what you find out regarding and applying it to other sites laking it.

Starson17 · 10-08-2010, 02:13 PM

Quote:

Originally Posted by kidblue

Does the above suggest that "use print version" take the place of /content/printVersion?

I'm not sure how much you understand about recipes or what Tony did in that particular case. The normal use of the print_version method is described in the Calibre API.

When a recipe is run, it takes a list of feeds, parses each feed to find links to articles, and fetches the articles. The print_version method simply modifies the link to the article so that you get a simpler version of the article than the feed link supplies. Instead of going to the article, with all of its junk, Calibre goes to the print version of that article. With me so far?

In some cases, the link to the article is not enough. The link is said to be "obfuscated." There are lots of reasons why the link is not enough. Perhaps cookies need to be handled, perhaps it's a referer issue, but regardless, it's usually possible to follow that link with a browser set up inside the recipe. That's what Tony did. Calibre allows you to set up an internal browser session that will behave in a way that's closer to the way a normal browser works. As he did that, it looks like Tony made the link go to the print version, but, he didn't have to send it there, and it's different from just using the print_version method. The link I gave you describes the bit of Calibre recipe code that Tony used that sets up the internal browser. It gives a good explanation of the code and how it's used. You seemed to be interested.

kidblue · 10-08-2010, 02:18 PM

I'm extremely interested - I'm the type to likes to do these sorts of things myself, and not be a total leech off a development community.

I've been trying to learn over the last couple days, reading the API and generally studying the existing recipes. I happened to pick the L.A. Weekly as my first recipe and it just coincidentally happened to be a more complicated one than straight-up RSS-to-full-article conversion.

Tony went to a whole new level by cooking this up and I'm eager to watch this process, as it's obviously a little beyond my rudimentary understanding. Thanks to you for joining him in illustrating what goes into something like this.

Starson17 · 10-08-2010, 02:46 PM

Quote:

Originally Posted by kidblue

I'm extremely interested -

Reading the Advanced Recipe page will clue you in to what Tony did, and how he made the internal browser click on the print link. I have to say, I'm not sure why he chose to do it that way. I haven't checked to see if the links are truly obfuscated, or if there was some reason why just stripping the article page was more difficult than going to the print page.

Based on what I read here, it looks like Tony had some trouble when he told the internal browser to click on a link to the print version, and there was no such link?

Another approach might have been to use print_version, build a soup of the article page, extract the desired print link and return that, but if it's not found, do something else. Each recipe writer has different preferences, and unless you really look closely at the article pages, it's hard to know why they chose to do what they did.

kidblue · 10-08-2010, 02:48 PM

Is there an obvious example of the recipe you're describing - One where the page is extracted as opposed to using the browser?

Starson17 · 10-08-2010, 03:00 PM

Quote:

Originally Posted by kidblue

Is there an obvious example of the recipe you're describing - One where the page is extracted as opposed to using the browser?

Not that I know of, but it's easy to build:
1) let's grab the article url with print_version:

Code:

    def print_version(self, url):
        print 'print_v url is: ', url
        return url

By itself that does nothing, other than print the url of the article.

2) Now, let's turn it into a soup and print that:

Code:

    def print_version(self, url):
        print 'print_v url is: ', url
        soup = self.index_to_soup(url)
        print 'The soup is: ', soup
        return url

Again, this code does nothing other than print the article page source and the url of that page. You'd need to find what you want in the soup, and return that.

Edit: To be clear, I'm not saying this code will work in your case. If the article links are obfuscated, the browser method is needed.

10-07-2010, 12:08 AM	#1
kidblue Connoisseur Posts: 79 Karma: 10 Join Date: Oct 2010 Device: Kindle 3	LA Weekly - Trouble - Full articles? I tried my first recipe today with the L.A. Weekly's RSS feeds. Unfortunately, their feeds do not link to entire articles, so I was left with a smattering of "one liners" and articles mostly awash with ads, etc. Has anyone done an L.A. Weekly recipe or is there any kind of template for use specifically with their feeds? I've been reading up on cooking my own, but I can't quite put it to use with their system. Thanks, Noah

10-07-2010, 10:18 AM	#2
TonytheBookworm Addict Posts: 264 Karma: 62 Join Date: May 2010 Device: kindle 2, kindle 3, Kindle fire	Original url is: http://www.laweekly.com/2010-10-07/f...ervation-road/ The print version is kinda tough to get but we can fix that. print url is: http://www.laweekly.com/content/printVersion/1080621/ just use something along the lines of this: Spoiler: Code: temp_files = [] articles_are_obfuscated = True def get_obfuscated_article(self, url): br = self.get_browser() br.open(url) response = br.follow_link(url_regex = r'/content/printVersion/[0-9]+', nr = 0) html = response.read() self.temp_files.append(PersistentTemporaryFile('_temparse.html')) self.temp_files[-1].write(html) self.temp_files[-1].close() return self.temp_files[-1].name

10-07-2010, 12:18 PM	#3
kidblue Connoisseur Posts: 79 Karma: 10 Join Date: Oct 2010 Device: Kindle 3	Muchas gracias. I'm trying hard, but still lost - Just adding the above doesn't reference specific articles, which need to be pulled. Plus, I'm a little boggled as to how to reference the above code to multiple feeds - i.e. the Music, Moves, Calendar sections, etc. I'm not looking for a handout of a "free" recipe, I'm into learning, but my days of coding Q-BASIC are a little bit behind me Spoiler: class AdvancedUserRecipe1286467894(BasicNewsRecipe): title = u'LA Weekly' oldest_article = 7 max_articles_per_feed = 100 feeds = [(u'Movies', u'feed://laweekly.com/syndication/section/film')] temp_files = [] articles_are_obfuscated = True def get_obfuscated_article(self, url): br = self.get_browser() br.open(url) response = br.follow_link(url_regex = r'/content/printVersion/[0-9]+', nr = 0) html = response.read() self.temp_files.append(PersistentTemporaryFile('_t emparse.html')) self.temp_files[-1].write(html) self.temp_files[-1].close() return self.temp_files[-1].name

10-07-2010, 12:43 PM	#4
TonytheBookworm Addict Posts: 264 Karma: 62 Join Date: May 2010 Device: kindle 2, kindle 3, Kindle fire	I'll look at it when I get time more than likely the weekend. I am swamped in work right now p.s. there is a little more to it than the code you posted. Again If i get time I will look at it. It shouldn't be too hard to do. as for the other links it depends if they follow the same formatting of the ../content/printVersion/IDnumber or not. If they do then that would be rather simple. If the url is totally different then a series of if statements might be utilized to check what the index url is and then work accordingly. In the meantime consider getting these things 1) Ultra-Edit 2) Firebug for firefox Without the above programs (my personal opinion) you are chasing your tail trying to find code you want. With ultra-edit you can simply take and cheat like i do and search the built in recipes for code. For example you could look up print_version and you would see all the recipes that use that and see how they did it and why they did it that way. Then with firebug you can right click on the element in the html and you can get its corresponding tag. for instance if the content is only in a div tag with the class name of content. like <div class='content'> blah blah blah </div> and we don't want anything else then we could use keep_only_tags and so forth. see what you can come up with in the meantime and i will help when i can. Last edited by TonytheBookworm; 10-07-2010 at 12:51 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
K3 review from Publisher's Weekly	carld	Amazon Kindle	3	08-26-2010 02:19 PM
Full Articles via RSS	jotheman	Reading and Management	17	07-06-2008 05:12 AM
Weekly Discounts at eBooks About Everything -- 12/20/07	KatrinaCardway	Deals and Resources (No Self-Promotion or Affiliate Links)	0	12-20-2007 01:18 PM
The Weekly Standard on Google Books	BenG	News	3	12-10-2007 10:16 AM

10-07-2010, 01:22 PM	#5
kidblue Connoisseur Posts: 79 Karma: 10 Join Date: Oct 2010 Device: Kindle 3	Thanks for the vote of confidence and offer to help. I've actually been playing with Ultra Edit, but I guess the problem is the obvious one: I don't know what code or tags will specifically work with that site. I've been trolling the HTML, and the tags are obvious, but stuff like "print_version" over a whole site is what I'm losing. I guess I may have picked a tough site since I'll have to have "following" code to the full stories (as opposed to a full RSS feed), but the L.A. Weekly is so totally awesome (and unwieldy), it's a necessity! Thanks again for your help and I appreciate the tutorial! Anything you can offer, I'm very appreciative.

10-08-2010, 12:58 PM	#7
kidblue Connoisseur Posts: 79 Karma: 10 Join Date: Oct 2010 Device: Kindle 3	Tony, This is above-and-beyond! Thanks not only for the recipe but for the included notations as to how it got put together. It was specifically "content/printVersion" that had thrown me for a loop and I (obviously) have no idea how to invoke something similar for the feeds that don't have it. Is this an instance where you can simply have the whole RSS feed pulled via a mechanized "lookup" for a link? I'm interested to know how to attack the "lack" of "content/printVersion", as well. Thanks so much again. Noah

10-08-2010, 01:40 PM	#9
Starson17 Wizard Posts: 4,004 Karma: 177841 Join Date: Dec 2009 Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T	Have you read this?

10-08-2010, 01:54 PM	#10
kidblue Connoisseur Posts: 79 Karma: 10 Join Date: Oct 2010 Device: Kindle 3	Does the above suggest that "use print version" take the place of /content/printVersion? Tony, thanks again for looking into this and I'm interested to know what you find out regarding and applying it to other sites laking it.

10-08-2010, 02:18 PM	#12
kidblue Connoisseur Posts: 79 Karma: 10 Join Date: Oct 2010 Device: Kindle 3	I'm extremely interested - I'm the type to likes to do these sorts of things myself, and not be a total leech off a development community. I've been trying to learn over the last couple days, reading the API and generally studying the existing recipes. I happened to pick the L.A. Weekly as my first recipe and it just coincidentally happened to be a more complicated one than straight-up RSS-to-full-article conversion. Tony went to a whole new level by cooking this up and I'm eager to watch this process, as it's obviously a little beyond my rudimentary understanding. Thanks to you for joining him in illustrating what goes into something like this.

10-08-2010, 02:48 PM	#14
kidblue Connoisseur Posts: 79 Karma: 10 Join Date: Oct 2010 Device: Kindle 3	Is there an obvious example of the recipe you're describing - One where the page is extracted as opposed to using the browser?