View Single Post
Old 08-23-2010, 01:15 PM   #2498
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by TonytheBookworm View Post
Man i wish there was a way i could ask questions without flooding this board and all.
I've worried about that issue, too, but most responses I received said that they didn't mind when I asked if people thought there was too much of this type of "how to" in this thread. It does provide a lot of good info, which is searchable and helps others write recipes, so I wouldn't worry too much. Just try to use the code and spoiler tags to keep the indents and the length of posts minimized. Personally, I like to read kiklop's recipes to see how he approaches certain problems, then I can ask him when I don't understand something.

Quote:
lets say in every parse i get something that has a doubleclick.net ad in it
I tried
Code:
filter_regexps = [r'feedads\.g\.doubleclick\.net']
and yeah i didn't see any indent errors this time.
thought well maybe if i use preprocess_regexps and remove all the instances of doubleclick first.
So then i looked in the beautiful soup documentation and after a big headache i'm still kinda lost
I tried this as well...
Code:
preprocess_regexps     = [(re.compile(r'feedads\.g\.doubleclick\.net', re.DOTALL), lambda m: '')]
I've never needed to use either of those methods to remove doubleclick ads. For me, it's always been possible to define either a keep_only or a remove. As to Beautiful Soup, I'm still learning, and I've read that page at least 50 times. I expect I'll end up reading it another 50 times eventually.

Let's start with filter_regexps. I've only used it once. It's used to prevent a link from being followed. Most of the time, you're not following a link because recursion is off and Calibre isn't following links on the pages. What you normally want to do is remove the link or graphic from your page, not prevent it from being followed by Calibre.

OTOH, I use preprocess_regexps a lot - but as a sort of last resort. It's simply a powerful search and replace on the HTML. You could do most of your remove_tags with preprocess_regexps if you wanted to. But, it's not tag-aware, so remove_tags is better in most cases (it won't be confused if there's a div tag inside a div tag, where S&R might find the open div tag of an outer tag and the close div of an inner tag. Why don't you show me the actual page source for the doubleclick you want to deal with, or give me a link,so I can understand what you are trying to remove?

BTW, If you look at page source with your browser, it may not be the same as what Calibre sees. It may also be wrong if you look at it with FireBug. To see it as Calibre will see it I like to do this:

Code:
    def preprocess_html(self, soup):
        print 'The soup is: ', soup
        return soup
If you add this code, it does nothing, but the print statement sends the html in cleaned-up Beautiful Soup form into your textfile.txt as Calibre will see it. (you are using ebook-convert ....>textfile.txt format - right?)
Starson17 is offline