10-16-2010, 01:23 PM | #1 |
Junior Member
Posts: 3
Karma: 10
Join Date: Oct 2010
Device: Kindle
|
So close yet so far... frustrated recipe
Can anyone help me fix this recipe? I'm trying to fetch news from a local newspaper. I think I'm *almost* there, but I suck with the regex because I don't know programming. Thanks
P.S. searched the forums and spent hours and hours doing the recipe before posting here as last resort Code:
class AdvancedUserRecipe1287215970(BasicNewsRecipe): title = u'The Star Malaysia' oldest_article = 2 max_articles_per_feed = 1 feeds = [(u'Nation News', u'http://thestar.com.my/rss/nation.xml'), (u'Business News', u'http://thestar.com.my/rss/business.xml'), (u'Technology News', u'http://thestar.com.my/rss/technology.xml'), (u'World Updates', u'http://thestar.com.my/rss/worldupdates.xml'), (u'Sports News', u'http://thestar.com.my/rss/sports.xml'), (u'Columnists', u'http://thestar.com.my/rss/columnists.xml'), (u'Opinions', u'http://thestar.com.my/rss/opinion.xml')] from calibre.ptempfile import PersistentTemporaryFile temp_files = [] articles_are_obfuscated = True def get_obfuscated_article(self, url): br = self.get_browser() br.open(url) response = br.follow_link(url_regex = r'/printerfriendly.asp?file=') html = response.read() self.temp_files.append(PersistentTemporaryFile('_fa.html')) self.temp_files[-1].write(html) self.temp_files[-1].close() return self.temp_files[-1].name |
10-16-2010, 03:30 PM | #2 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
|
Advert | |
|
10-16-2010, 06:45 PM | #3 | |
Addict
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
|
Quote:
@pip: the reason your code wasn't working is because the reg expression was was not escaped right. here is working code using obfuscation. Spoiler:
|
|
10-16-2010, 08:02 PM | #4 |
Junior Member
Posts: 3
Karma: 10
Join Date: Oct 2010
Device: Kindle
|
Thanks Starson17 and Tony for the recipe!
I have no background in programming, so it's easier for me to copy the example given in the recipe manual than to make one from scratch. By the way, the div version captures pictures and is 1.2mb in size whereas the printversion is 0.5mb but pictureless. I had a look at some of these printversions and they do show pictures, could it be because the stylesheet is turned off that these pics weren't captured? For example this url: http://biz.thestar.com.my/news/story...9&sec=business Last edited by PipSqueak; 10-16-2010 at 09:11 PM. |
10-17-2010, 08:40 AM | #5 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
If that's the case, then you should show him how to use the tool that's designed to do that job - print_version. It still doesn't look to me like there's any obfuscation going on. I briefly looked at the print link and it appeared to be a simple text substitution in the link.
|
Advert | |
|
10-17-2010, 06:41 PM | #6 | |
Addict
Posts: 264
Karma: 62
Join Date: May 2010
Device: kindle 2, kindle 3, Kindle fire
|
Quote:
|
|
10-18-2010, 10:50 AM | #7 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
For others: what Tony and I are talking about is that Tony has used a sophisticated option to download a page from the article, then "click" on a button on that page to get the print version. It works just like your browser works by setting up an internal browser session. To use his code, you use a regex to "find" and click the button on the downloaded page that gets the print version. Kovid has what I consider to be a more straight forward way of getting the print version. You look at the page, find the same link that Tony's code searches for, and tell your recipe to modify the article link to go directly to the print version page. It skips the steps of setting up an internal browser, downloading the page locally, keeping track of cookies, searching in that page via the regex for the link, then clicking the print version button. Tony's obfuscated method works when there's no way to figure out how to change the article link to the print version link, or where the site requires certain cookies to be set before you can get the print version. Both work for normal print version links, and Tony's code works in more situations than the simpler code (i.e. when the link really is "obfuscated"), but at the cost of slightly greater complexity and slower speed. Each recipe author uses their own techniques. Last edited by Starson17; 10-18-2010 at 10:54 AM. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Frustrated... | cypherslock | Amazon Kindle | 3 | 04-03-2010 05:23 PM |
So frustrated | lishy75 | Sony Reader | 5 | 04-19-2009 05:21 PM |
FRUSTRATED! | jcbeam | Amazon Kindle | 33 | 03-21-2009 08:58 AM |
New and Frustrated | STORMCROW | Introduce Yourself | 7 | 02-27-2008 09:58 PM |