12-13-2011, 10:35 PM | #1 |
Member
Posts: 11
Karma: 10
Join Date: Nov 2011
Device: Kobo Touch
|
Cleaning up tags based on content
Hi There,
I'm still new to Calibre, and haven't used Python before (though plenty other languages). I'm slowly getting the nack of it, but need some pointers with some things. I've got a recipe for time.com RSS feeds, and that all works nice. However, time has the annoying habit of inserting a little "advertisement" in their RSS feeds. Every Ad starts with "<p><strong>MORE:</strong>" and terminates with the </p>. Of course, they use non-classed <p> tags elsewhere so simply removing all <p> tags won't work. I figured postprocess_html is the way to go, then parse the 'soup' for <p> tags and remove the ones that contain "<strong>MORE:</strong>". But, that just leaves me creating the code... Anyone able to give some pointers? Cheers! Olger. |
12-14-2011, 12:50 PM | #2 |
doofus
Posts: 2,521
Karma: 13088847
Join Date: Sep 2010
Device: Kobo Libra 2, Kindle Voyage
|
try using a preprocess regex
Code:
preprocess_regexps = [ (re.compile(r'<p><strong>MORE:</strong>.+?</p>', re.I|re.DOTALL), lambda x:''), ] |
Advert | |
|
12-14-2011, 10:56 PM | #3 |
Member
Posts: 11
Karma: 10
Join Date: Nov 2011
Device: Kobo Touch
|
Thanks Barty!
Once I figured out I also need to do an "import re" at the beginning of the recipe it worked well! Cheers! |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
deleting tags from available tags list | BeccaPrice | Calibre | 8 | 10-15-2011 11:39 AM |
Screen cleaner ruined my eReader [tags: clean, cleaning] | Kevin8or | General Discussions | 44 | 10-11-2011 01:00 PM |
Amazon Tags - Popular tags vs Unique tags. | chrisanthropic | Writers' Corner | 6 | 09-19-2011 11:18 PM |
cleaning the content.opf file | Adjust | ePub | 6 | 09-01-2010 05:54 PM |
Choosing a reader based on open content formats | wrburgess | Which one should I buy? | 3 | 12-01-2009 09:37 PM |