MobileRead Forums - View Single Post - Custom recipes (archive, read-only)

Starson17 · 06-17-2010, 04:58 PM

Quote:

Originally Posted by lordvetinari2

As always, thanks a lot for your help, Starson17.

You're welcome. Be aware, I'm no expert, but I've been able to make the recipes do anything I've really tried to get them to do, so I've wandered through many different parts.

Quote:

Pre and post stuff is in the ZIP attachment from my previous post. Is that what you mean?

What I mean is that I run preprocess_html(soup) with a simple print command:

Code:

print 'The preprocess soup is: ', soup

Then I do it with postprocess_html. This lets me see the html sorted by BeautifulSoup at different stages. Your garbled text is presumably not garbled on the source page, so it's getting garbled during processing. This would help track down where it's happening.

Quote:

The thing is, content from that feed appears in tag names that are also used for elements that I don't need. One of those is the meta name content, which provides an unclosed tag when parsed. Anyway, I'm guessing it means messing about with some deep BeautifulSoup stuff, so I prefer to remove that feed completely and be done with it.

All your questions have answers only found in BeautifulSoup. The worse the site, the more you need it. The entire recipe system uses it under the hood, anyway. Each time you asked if there was an easy way to do something, I thought .... not unless you think using Beautiful Soup is easy.

Quote:

It's just going to the next article, there's no multipage used in these feeds. Yes, the article is already in the feed, as I can get there one pageturn at a time.

So if it's just going to the next article, why not strip that "Next" element and not worry about whether it links or not?

Three methods of stripping I typically use:

1) Use the remove_tags, keep_only_tags, etc. This is easy.

2) Use preprocess_html(soup), find your tag, use .extract() This is only a bit harder.

3) Get down and dirty with .preprocess_regexps. You provide a list of regexp substitution rules to run on the downloaded html. Each element of the list is a two element tuple. The first element of the tuple is a compiled regular expression and the second a callable that takes a single match object and returns a string to replace the match. It's basically text-based, not tag-based, search and replace in the html. You can remove tags, change tags, fix broken tags, change links, etc. It's very flexible for difficult situations.