Selective preprocess_regexps

dasp · 12-05-2011, 04:40 PM

Hi,

Is there a way to selectively turn on/off the usage of preprocess_regexps?

My recipe's parse_index() works as follows:

1. first visit the newspaper's main page, extract section names and section url's
2. visit the section url to extract the articles within that section

With the latest update to newspaper's site, step 2 fails because preprocess_regexps strips out the html part containing the article titles and urls.

I need the preprocess_regexps because it strips out all the crap in the actual article contents; however, I don't need it/want it during the parse_index() stage.

Is there a solution for my problem?

Thanks!

kovidgoyal · 12-05-2011, 09:46 PM

Implement preprocess_raw_html() in your recipe and apply the regexes yourself after checking the HTML.

dasp · 12-06-2011, 05:48 AM

Thanks for the response, Kovid.

In preprocess_raw_html() I can either use the URL or the raw HTML contents to decide between applying the regexes or not.

The section and article URLs are generic across the site, e.g.:

http://example.com/?ItemID=3A62EEC05...0733E0349CBA67
http://example.com/?ItemID=59DB44974...66062D8B9D9422
http://example.com/?ItemID=F15164ADF...8EEFD7ED078DAA

...so consequently I cannot distinguish where I am based on the URLs alone.

That leaves me with searching for some particular pattern/string in the raw HTML that would distinguish the page as the one containing a section list or the one containing section's article list. However, this is not the cleanest solution I was hoping for.

I was hoping for a way to somehow "turn off" calibre's usage of preprocess_regexps when entering parse_index() and turning it back on right before exiting.

Something along the lines:

PHP Code:


			
FULL_REGEX_LIST = [ re.compile(...) ]

preprocess_regexps = FULL_REGEX_LIST



def parse_index():

  self.preprocess_regexps = []

  ## construct feeds list

  self.preprocess_regexps = FULL_REGEX_LIST

  return feeds

I actually tried this and it didn't work. So if you have any idea/suggestion to achieve this, great. If not, I'll do it the way you originally proposed.

Thanks.

kovidgoyal · 12-06-2011, 08:52 AM

Code:

def parse_index(self, *args, **kwargs):
   self.preprocess_regexps = []
   ret = BasicNewsRecipe.parse_index(self, *args, **kwargs)
   self.preprocess_regexps = orignal_value
   return ret

12-05-2011, 04:40 PM	#1
dasp Junior Member Posts: 7 Karma: 10 Join Date: Jul 2011 Device: Kindle	Selective preprocess_regexps Hi, Is there a way to selectively turn on/off the usage of preprocess_regexps? My recipe's parse_index() works as follows: 1. first visit the newspaper's main page, extract section names and section url's 2. visit the section url to extract the articles within that section With the latest update to newspaper's site, step 2 fails because preprocess_regexps strips out the html part containing the article titles and urls. I need the preprocess_regexps because it strips out all the crap in the actual article contents; however, I don't need it/want it during the parse_index() stage. Is there a solution for my problem? Thanks!

12-06-2011, 05:48 AM	#3
dasp Junior Member Posts: 7 Karma: 10 Join Date: Jul 2011 Device: Kindle	Thanks for the response, Kovid. In preprocess_raw_html() I can either use the URL or the raw HTML contents to decide between applying the regexes or not. The section and article URLs are generic across the site, e.g.: http://example.com/?ItemID=3A62EEC05...0733E0349CBA67 http://example.com/?ItemID=59DB44974...66062D8B9D9422 http://example.com/?ItemID=F15164ADF...8EEFD7ED078DAA ...so consequently I cannot distinguish where I am based on the URLs alone. That leaves me with searching for some particular pattern/string in the raw HTML that would distinguish the page as the one containing a section list or the one containing section's article list. However, this is not the cleanest solution I was hoping for. I was hoping for a way to somehow "turn off" calibre's usage of preprocess_regexps when entering parse_index() and turning it back on right before exiting. Something along the lines: PHP Code: `FULL_REGEX_LIST = [ re.compile(...) ] preprocess_regexps = FULL_REGEX_LIST def parse_index(): self.preprocess_regexps = [] ## construct feeds list self.preprocess_regexps = FULL_REGEX_LIST return feeds` I actually tried this and it didn't work. So if you have any idea/suggestion to achieve this, great. If not, I'll do it the way you originally proposed. Thanks. Last edited by dasp; 12-06-2011 at 05:54 AM. Reason: formatting issues

12-06-2011, 08:52 AM	#4
kovidgoyal creator of calibre Posts: 43,844 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Code: def parse_index(self, args, kwargs): self.preprocess_regexps = [] ret = BasicNewsRecipe.parse_index(self, args, **kwargs) self.preprocess_regexps = orignal_value return ret

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Touch Kobo desktop - selective sync to KT?	kiwipippa	Kobo Reader	10	07-01-2011 04:21 PM
Selective format conversion?	drmathprog	Library Management	2	04-19-2011 08:43 AM
preprocess_regexps and ePub-based Recipes?	tobias2	Recipes	6	02-13-2011 04:59 PM
Problem with preprocess_regexps and Unicode	mccande	Calibre	8	12-19-2008 09:26 AM
Selective exclusion of Hyperlinks	SteffenH	Sony Reader	4	10-03-2007 06:51 AM

12-05-2011, 09:46 PM	#2
kovidgoyal creator of calibre Posts: 43,844 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Implement preprocess_raw_html() in your recipe and apply the regexes yourself after checking the HTML.

Advert