MobileRead Forums - View Single Post

dasp · 12-06-2011, 05:48 AM

Thanks for the response, Kovid.

In preprocess_raw_html() I can either use the URL or the raw HTML contents to decide between applying the regexes or not.

The section and article URLs are generic across the site, e.g.:

http://example.com/?ItemID=3A62EEC05...0733E0349CBA67
http://example.com/?ItemID=59DB44974...66062D8B9D9422
http://example.com/?ItemID=F15164ADF...8EEFD7ED078DAA

...so consequently I cannot distinguish where I am based on the URLs alone.

That leaves me with searching for some particular pattern/string in the raw HTML that would distinguish the page as the one containing a section list or the one containing section's article list. However, this is not the cleanest solution I was hoping for.

I was hoping for a way to somehow "turn off" calibre's usage of preprocess_regexps when entering parse_index() and turning it back on right before exiting.

Something along the lines:

PHP Code:


			
FULL_REGEX_LIST = [ re.compile(...) ]

preprocess_regexps = FULL_REGEX_LIST



def parse_index():

  self.preprocess_regexps = []

  ## construct feeds list

  self.preprocess_regexps = FULL_REGEX_LIST

  return feeds

I actually tried this and it didn't work. So if you have any idea/suggestion to achieve this, great. If not, I'll do it the way you originally proposed.

Thanks.

12-06-2011, 05:48 AM	#3
dasp Junior Member Posts: 7 Karma: 10 Join Date: Jul 2011 Device: Kindle	Thanks for the response, Kovid. In preprocess_raw_html() I can either use the URL or the raw HTML contents to decide between applying the regexes or not. The section and article URLs are generic across the site, e.g.: http://example.com/?ItemID=3A62EEC05...0733E0349CBA67 http://example.com/?ItemID=59DB44974...66062D8B9D9422 http://example.com/?ItemID=F15164ADF...8EEFD7ED078DAA ...so consequently I cannot distinguish where I am based on the URLs alone. That leaves me with searching for some particular pattern/string in the raw HTML that would distinguish the page as the one containing a section list or the one containing section's article list. However, this is not the cleanest solution I was hoping for. I was hoping for a way to somehow "turn off" calibre's usage of preprocess_regexps when entering parse_index() and turning it back on right before exiting. Something along the lines: PHP Code: `FULL_REGEX_LIST = [ re.compile(...) ] preprocess_regexps = FULL_REGEX_LIST def parse_index(): self.preprocess_regexps = [] ## construct feeds list self.preprocess_regexps = FULL_REGEX_LIST return feeds` I actually tried this and it didn't work. So if you have any idea/suggestion to achieve this, great. If not, I'll do it the way you originally proposed. Thanks. Last edited by dasp; 12-06-2011 at 05:54 AM. Reason: formatting issues