Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 12-05-2011, 04:40 PM   #1
dasp
Junior Member
dasp began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Jul 2011
Device: Kindle
Selective preprocess_regexps

Hi,

Is there a way to selectively turn on/off the usage of preprocess_regexps?

My recipe's parse_index() works as follows:

1. first visit the newspaper's main page, extract section names and section url's
2. visit the section url to extract the articles within that section

With the latest update to newspaper's site, step 2 fails because preprocess_regexps strips out the html part containing the article titles and urls.

I need the preprocess_regexps because it strips out all the crap in the actual article contents; however, I don't need it/want it during the parse_index() stage.

Is there a solution for my problem?

Thanks!
dasp is offline   Reply With Quote
Old 12-05-2011, 09:46 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,844
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Implement preprocess_raw_html() in your recipe and apply the regexes yourself after checking the HTML.
kovidgoyal is offline   Reply With Quote
Advert
Old 12-06-2011, 05:48 AM   #3
dasp
Junior Member
dasp began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Jul 2011
Device: Kindle
Thanks for the response, Kovid.

In preprocess_raw_html() I can either use the URL or the raw HTML contents to decide between applying the regexes or not.

The section and article URLs are generic across the site, e.g.:

http://example.com/?ItemID=3A62EEC05...0733E0349CBA67
http://example.com/?ItemID=59DB44974...66062D8B9D9422
http://example.com/?ItemID=F15164ADF...8EEFD7ED078DAA

...so consequently I cannot distinguish where I am based on the URLs alone.

That leaves me with searching for some particular pattern/string in the raw HTML that would distinguish the page as the one containing a section list or the one containing section's article list. However, this is not the cleanest solution I was hoping for.

I was hoping for a way to somehow "turn off" calibre's usage of preprocess_regexps when entering parse_index() and turning it back on right before exiting.

Something along the lines:

PHP Code:
FULL_REGEX_LIST = [ re.compile(...) ]
preprocess_regexps FULL_REGEX_LIST

def parse_index
():
  
self.preprocess_regexps = []
  
## construct feeds list
  
self.preprocess_regexps FULL_REGEX_LIST
  
return feeds 
I actually tried this and it didn't work. So if you have any idea/suggestion to achieve this, great. If not, I'll do it the way you originally proposed.

Thanks.

Last edited by dasp; 12-06-2011 at 05:54 AM. Reason: formatting issues
dasp is offline   Reply With Quote
Old 12-06-2011, 08:52 AM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,844
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Code:
def parse_index(self, *args, **kwargs):
   self.preprocess_regexps = []
   ret = BasicNewsRecipe.parse_index(self, *args, **kwargs)
   self.preprocess_regexps = orignal_value
   return ret
kovidgoyal is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Touch Kobo desktop - selective sync to KT? kiwipippa Kobo Reader 10 07-01-2011 04:21 PM
Selective format conversion? drmathprog Library Management 2 04-19-2011 08:43 AM
preprocess_regexps and ePub-based Recipes? tobias2 Recipes 6 02-13-2011 04:59 PM
Problem with preprocess_regexps and Unicode mccande Calibre 8 12-19-2008 09:26 AM
Selective exclusion of Hyperlinks SteffenH Sony Reader 4 10-03-2007 06:51 AM


All times are GMT -4. The time now is 11:24 PM.


MobileRead.com is a privately owned, operated and funded community.