Cleaning up tags based on content

Olger · 12-13-2011, 10:35 PM

Hi There,
I'm still new to Calibre, and haven't used Python before (though plenty other languages). I'm slowly getting the nack of it, but need some pointers with some things.

I've got a recipe for time.com RSS feeds, and that all works nice. However, time has the annoying habit of inserting a little "advertisement" in their RSS feeds. Every Ad starts with "MORE:" and terminates with the .
Of course, they use non-classed tags elsewhere so simply removing all tags won't work. I figured postprocess_html is the way to go, then parse the 'soup' for tags and remove the ones that contain "MORE:".
But, that just leaves me creating the code... Anyone able to give some pointers?

Cheers! Olger.

Barty · 12-14-2011, 12:50 PM

try using a preprocess regex

Code:

    preprocess_regexps = [
        (re.compile(r'<p><strong>MORE:</strong>.+?</p>', re.I|re.DOTALL), lambda x:''),
        ]

You can use just re.DOTALL instead of re.I|re.DOTALL if you know the case will always be exactly like that (re.I means ignore case).

Olger · 12-14-2011, 10:56 PM

Thanks Barty!
Once I figured out I also need to do an "import re" at the beginning of the recipe it worked well!
Cheers!

12-13-2011, 10:35 PM	#1
Olger Member Posts: 11 Karma: 10 Join Date: Nov 2011 Device: Kobo Touch	Cleaning up tags based on content Hi There, I'm still new to Calibre, and haven't used Python before (though plenty other languages). I'm slowly getting the nack of it, but need some pointers with some things. I've got a recipe for time.com RSS feeds, and that all works nice. However, time has the annoying habit of inserting a little "advertisement" in their RSS feeds. Every Ad starts with "<p><strong>MORE:</strong>" and terminates with the </p>. Of course, they use non-classed <p> tags elsewhere so simply removing all <p> tags won't work. I figured postprocess_html is the way to go, then parse the 'soup' for <p> tags and remove the ones that contain "<strong>MORE:</strong>". But, that just leaves me creating the code... Anyone able to give some pointers? Cheers! Olger.

12-14-2011, 12:50 PM	#2
Barty doofus Posts: 2,521 Karma: 13088847 Join Date: Sep 2010 Device: Kobo Libra 2, Kindle Voyage	try using a preprocess regex Code: preprocess_regexps = [ (re.compile(r'<p><strong>MORE:</strong>.+?</p>', re.I\|re.DOTALL), lambda x:''), ] You can use just re.DOTALL instead of re.I\|re.DOTALL if you know the case will always be exactly like that (re.I means ignore case).

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
deleting tags from available tags list	BeccaPrice	Calibre	8	10-15-2011 11:39 AM
Screen cleaner ruined my eReader [tags: clean, cleaning]	Kevin8or	General Discussions	44	10-11-2011 01:00 PM
Amazon Tags - Popular tags vs Unique tags.	chrisanthropic	Writers' Corner	6	09-19-2011 11:18 PM
cleaning the content.opf file	Adjust	ePub	6	09-01-2010 05:54 PM
Choosing a reader based on open content formats	wrburgess	Which one should I buy?	3	12-01-2009 09:37 PM

12-14-2011, 10:56 PM	#3
Olger Member Posts: 11 Karma: 10 Join Date: Nov 2011 Device: Kobo Touch	Thanks Barty! Once I figured out I also need to do an "import re" at the beginning of the recipe it worked well! Cheers!

Advert