Thanks for the response, Kovid.
In preprocess_raw_html() I can either use the URL or the raw HTML contents to decide between applying the regexes or not.
The section and article URLs are generic across the site, e.g.:
http://example.com/?ItemID=3A62EEC05...0733E0349CBA67
http://example.com/?ItemID=59DB44974...66062D8B9D9422
http://example.com/?ItemID=F15164ADF...8EEFD7ED078DAA
...so consequently I cannot distinguish where I am based on the URLs alone.
That leaves me with searching for some particular pattern/string in the raw HTML that would distinguish the page as the one containing a section list or the one containing section's article list. However, this is not the cleanest solution I was hoping for.
I was hoping for a way to somehow "turn off" calibre's usage of preprocess_regexps when entering parse_index() and turning it back on right before exiting.
Something along the lines:
PHP Code:
FULL_REGEX_LIST = [ re.compile(...) ]
preprocess_regexps = FULL_REGEX_LIST
def parse_index():
self.preprocess_regexps = []
## construct feeds list
self.preprocess_regexps = FULL_REGEX_LIST
return feeds
I actually tried this and it didn't work. So if you have any idea/suggestion to achieve this, great. If not, I'll do it the way you originally proposed.
Thanks.