Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 12-13-2011, 11:35 PM   #1
Olger
Member
Olger began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Nov 2011
Device: Kobo Touch
Cleaning up tags based on content

Hi There,
I'm still new to Calibre, and haven't used Python before (though plenty other languages). I'm slowly getting the nack of it, but need some pointers with some things.

I've got a recipe for time.com RSS feeds, and that all works nice. However, time has the annoying habit of inserting a little "advertisement" in their RSS feeds. Every Ad starts with "<p><strong>MORE:</strong>" and terminates with the </p>.
Of course, they use non-classed <p> tags elsewhere so simply removing all <p> tags won't work. I figured postprocess_html is the way to go, then parse the 'soup' for <p> tags and remove the ones that contain "<strong>MORE:</strong>".
But, that just leaves me creating the code... Anyone able to give some pointers?

Cheers! Olger.
Olger is offline   Reply With Quote
Old 12-14-2011, 01:50 PM   #2
Barty
Wizard
Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.Barty ought to be getting tired of karma fortunes by now.
 
Posts: 1,571
Karma: 3139999
Join Date: Sep 2010
Device: Kindle 3, PW2, iPad 3
try using a preprocess regex

Code:
    preprocess_regexps = [
        (re.compile(r'<p><strong>MORE:</strong>.+?</p>', re.I|re.DOTALL), lambda x:''),
        ]
You can use just re.DOTALL instead of re.I|re.DOTALL if you know the case will always be exactly like that (re.I means ignore case).
Barty is offline   Reply With Quote
Old 12-14-2011, 11:56 PM   #3
Olger
Member
Olger began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Nov 2011
Device: Kobo Touch
Thanks Barty!
Once I figured out I also need to do an "import re" at the beginning of the recipe it worked well!
Cheers!
Olger is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
deleting tags from available tags list BeccaPrice Calibre 8 10-15-2011 12:39 PM
Screen cleaner ruined my eReader [tags: clean, cleaning] Kevin8or General Discussions 44 10-11-2011 02:00 PM
Amazon Tags - Popular tags vs Unique tags. chrisanthropic Writers' Corner 6 09-20-2011 12:18 AM
cleaning the content.opf file Adjust ePub 6 09-01-2010 06:54 PM
Choosing a reader based on open content formats wrburgess Which one should I buy? 3 12-01-2009 10:37 PM


All times are GMT -4. The time now is 08:17 AM.


MobileRead.com is a privately owned, operated and funded community.