View Single Post
Old 12-06-2011, 05:48 AM   #3
dasp
Junior Member
dasp began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Jul 2011
Device: Kindle
Thanks for the response, Kovid.

In preprocess_raw_html() I can either use the URL or the raw HTML contents to decide between applying the regexes or not.

The section and article URLs are generic across the site, e.g.:

http://example.com/?ItemID=3A62EEC05...0733E0349CBA67
http://example.com/?ItemID=59DB44974...66062D8B9D9422
http://example.com/?ItemID=F15164ADF...8EEFD7ED078DAA

...so consequently I cannot distinguish where I am based on the URLs alone.

That leaves me with searching for some particular pattern/string in the raw HTML that would distinguish the page as the one containing a section list or the one containing section's article list. However, this is not the cleanest solution I was hoping for.

I was hoping for a way to somehow "turn off" calibre's usage of preprocess_regexps when entering parse_index() and turning it back on right before exiting.

Something along the lines:

PHP Code:
FULL_REGEX_LIST = [ re.compile(...) ]
preprocess_regexps FULL_REGEX_LIST

def parse_index
():
  
self.preprocess_regexps = []
  
## construct feeds list
  
self.preprocess_regexps FULL_REGEX_LIST
  
return feeds 
I actually tried this and it didn't work. So if you have any idea/suggestion to achieve this, great. If not, I'll do it the way you originally proposed.

Thanks.

Last edited by dasp; 12-06-2011 at 05:54 AM. Reason: formatting issues
dasp is offline   Reply With Quote