12-05-2011, 04:40 PM | #1 |
Junior Member
Posts: 7
Karma: 10
Join Date: Jul 2011
Device: Kindle
|
Selective preprocess_regexps
Hi,
Is there a way to selectively turn on/off the usage of preprocess_regexps? My recipe's parse_index() works as follows: 1. first visit the newspaper's main page, extract section names and section url's 2. visit the section url to extract the articles within that section With the latest update to newspaper's site, step 2 fails because preprocess_regexps strips out the html part containing the article titles and urls. I need the preprocess_regexps because it strips out all the crap in the actual article contents; however, I don't need it/want it during the parse_index() stage. Is there a solution for my problem? Thanks! |
12-05-2011, 09:46 PM | #2 |
creator of calibre
Posts: 43,844
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Implement preprocess_raw_html() in your recipe and apply the regexes yourself after checking the HTML.
|
Advert | |
|
12-06-2011, 05:48 AM | #3 |
Junior Member
Posts: 7
Karma: 10
Join Date: Jul 2011
Device: Kindle
|
Thanks for the response, Kovid.
In preprocess_raw_html() I can either use the URL or the raw HTML contents to decide between applying the regexes or not. The section and article URLs are generic across the site, e.g.: http://example.com/?ItemID=3A62EEC05...0733E0349CBA67 http://example.com/?ItemID=59DB44974...66062D8B9D9422 http://example.com/?ItemID=F15164ADF...8EEFD7ED078DAA ...so consequently I cannot distinguish where I am based on the URLs alone. That leaves me with searching for some particular pattern/string in the raw HTML that would distinguish the page as the one containing a section list or the one containing section's article list. However, this is not the cleanest solution I was hoping for. I was hoping for a way to somehow "turn off" calibre's usage of preprocess_regexps when entering parse_index() and turning it back on right before exiting. Something along the lines: PHP Code:
Thanks. Last edited by dasp; 12-06-2011 at 05:54 AM. Reason: formatting issues |
12-06-2011, 08:52 AM | #4 |
creator of calibre
Posts: 43,844
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Code:
def parse_index(self, *args, **kwargs): self.preprocess_regexps = [] ret = BasicNewsRecipe.parse_index(self, *args, **kwargs) self.preprocess_regexps = orignal_value return ret |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Touch Kobo desktop - selective sync to KT? | kiwipippa | Kobo Reader | 10 | 07-01-2011 04:21 PM |
Selective format conversion? | drmathprog | Library Management | 2 | 04-19-2011 08:43 AM |
preprocess_regexps and ePub-based Recipes? | tobias2 | Recipes | 6 | 02-13-2011 04:59 PM |
Problem with preprocess_regexps and Unicode | mccande | Calibre | 8 | 12-19-2008 09:26 AM |
Selective exclusion of Hyperlinks | SteffenH | Sony Reader | 4 | 10-03-2007 06:51 AM |