MobileRead Forums - View Single Post - Custom recipes (archive, read-only)

Starson17 · 03-06-2010, 04:13 PM

Quote:

Originally Posted by gabe973

Thanks to this thread I was able to get the Detroit Free Press and Detroit News working great. I was also able to get the Flint Journal looking nice (and that had looked really bad before). It's amazing what a couple of nudges in the right direction can do.

Yes, I beat my head against the wall, then someone gives me a nudge, and it turns out I need to add a comma or add one line. It's amazing how hard it can be sometimes to get a little detail, and how helpful a brief tip can be.

Quote:

However, I'm having a lot of trouble getting my local paper, the Davison Index, to work at all. The contents page comes up and it lists the articles, but then when you try to read an article, it's blank. What am I missing to get it to work? There RSS feed is as follows...

http://davisonindex.mihomepaper.com/current/feed

That's a tricky one. I'll get you started. So you try a simple recipe with:

Code:

feeds = [(u'Current Feed', u'http://davisonindex.mihomepaper.com/current/feed')]

and it pulls the feed page, but no articles. To fix this, I look at the links to the articles on the RSS page to see if any fancy cookies, obfuscating is going on, but no, it all looks normal, so I look at the parsing that Calibre is doing of those links. It turns out that Calibre's feedparser is failing. You can see how Calibre is parsing the feeds by running:

Code:

    def get_article_url(self, article):
        print 'article is:', article
        return article.get(None)

This just tells you what it can parse out, without actually pulling anything. The preferred parsed pointer to the article seems to be the id, at the end of what is printed, but the error messages say that fails. However, inspection of the parsed results shows that the correct link is there as "link", thus you want to add this to the recipe:

Code:

    def get_article_url(self, article):
        return article.get('link', None)

Problem 1 is solved. Unfortunately, you still don't get what you want. It's running a script, that seems to be necessary to pull your article. If you remove the scripts, you get no article data, so you have to add:

Code:

    remove_javascript   = False

That gets through the link parsing and keeps the article ..... but if you try it, you don't see the article - where is it?

If you look at the page source, you will see the article data is all there, but it doesn't display.

Try adding this:

Code:

    preprocess_regexps = [
        (re.compile(r'<!--.*-->', re.DOTALL|re.IGNORECASE), lambda match: ''),
        ]

Now it's just cleanup. Have fun!