preprocess_regexps and ePub-based Recipes?

tobias2 · 02-12-2011, 02:22 PM

Hi all,

For recipes based on downloading ePubs and then converting them such as, for example, "Now Toronto" the does not seem to get used. These recipes essentially only implement build_index(self). Any idea what needs to get added to this build_index(self) function such that the rules defined in preprocess_regexps get processed? Alternatively, I would also be happy with another way to add regular expression processing to the named type of recipes that goes beyond the three sr1_search and sr1_replace tags (1, 2, 3) in conversion_options

Thanks,

Tobias

kovidgoyal · 02-12-2011, 02:54 PM

You are free to run over the HTML in the downloaded epub and run whatever regexes you like in build_index

tobias2 · 02-13-2011, 06:25 AM

I am not too familiar with Python programming. Is there a simple call that I can add in build_index such that anything that is defined in preprocess_regexps gets processed? Right now the code (in the "Now Toronto" example) is as follows:

Code:

    preprocess_regexps    = [
        (re.compile(r'foo'), lambda match: 'bar'),
    ]

    def build_index(self):
        epub_feed = "http://feeds.feedburner.com/NowEpubEditions"
        soup = self.index_to_soup(epub_feed)
        url = soup.find(name = 'feedburner:origlink').string
        f = urllib2.urlopen(url)
        tmp = PersistentTemporaryFile(suffix='.epub')
        self.report_progress(0,_('downloading epub'))
        tmp.write(f.read())
        tmp.close()
        zfile = zipfile.ZipFile(tmp.name, 'r')
        self.report_progress(0,_('extracting epub'))
        zfile.extractall(self.output_dir)
        tmp.close()
        index = os.path.join(self.output_dir, 'content.opf')
        self.report_progress(1,_('epub downloaded and extracted'))

        return index

Thanks,

Tobias

kovidgoyal · 02-13-2011, 09:12 AM

There's no simple call, you have to write the code to iterate over all html files, read them run the regexps on them and write them back.

tobias2 · 02-13-2011, 01:31 PM

I now looked into the source for a while to get some idea of how to do this, but to no avail. There is too much I would need to do to be able to properly debug things and figure out how this works. Would you (or someone else for this matter) maybe be able to provide the lines that I would need to "iterate over all html files, read them run the regexps on them and write them back"? I would much appreciate this. I would think such an example may be generally helpful for the recipes that are based on ePub downloads.

Thanks in advance, to whoever finds the time to provide the lines.

Cheers,

Tobias

kovidgoyal · 02-13-2011, 01:54 PM

Code:

from calibre import walk

for path in walk('.'):
   if os.path.splitext(path)[1:].lower() in ('html', 'htm'):
       with open(path, 'r+b') as f:
           raw = f.read()
           raw = raw.decode('utf-8')
           for pat, func in self.preprocess_regexps:
                 raw = pat.sub(func, raw)
           f.seek(0)
           f.truncate()
           f.write(raw.encode('utf-8'))

This will need some adjustments, of course.

tobias2 · 02-13-2011, 04:59 PM

Awesome, thanks so much, I will give it a try.

Cheers,

Tobias

02-12-2011, 02:22 PM	#1
tobias2 Member Posts: 18 Karma: 36 Join Date: Feb 2011 Device: Kindle	preprocess_regexps and ePub-based Recipes? Hi all, For recipes based on downloading ePubs and then converting them such as, for example, "Now Toronto" the does not seem to get used. These recipes essentially only implement build_index(self). Any idea what needs to get added to this build_index(self) function such that the rules defined in preprocess_regexps get processed? Alternatively, I would also be happy with another way to add regular expression processing to the named type of recipes that goes beyond the three sr1_search and sr1_replace tags (1, 2, 3) in conversion_options Thanks, Tobias

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
How to improve navigation in EPUB from recipes?	siebert	Calibre	17	12-11-2010 11:14 AM
recipes and --no-default-epub-cover option	m.tarenskeen	Recipes	1	11-02-2010 12:06 PM
Free web-based epub creator: eBookFuel	CraigAtk	ePub	0	10-28-2010 01:17 PM
Where my recipes are kept?	bthoven	Calibre	6	02-26-2010 12:20 AM
Problem with preprocess_regexps and Unicode	mccande	Calibre	8	12-19-2008 09:26 AM

02-12-2011, 02:54 PM	#2
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You are free to run over the HTML in the downloaded epub and run whatever regexes you like in build_index

02-13-2011, 09:12 AM	#4
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	There's no simple call, you have to write the code to iterate over all html files, read them run the regexps on them and write them back.

02-13-2011, 01:31 PM	#5
tobias2 Member Posts: 18 Karma: 36 Join Date: Feb 2011 Device: Kindle	I now looked into the source for a while to get some idea of how to do this, but to no avail. There is too much I would need to do to be able to properly debug things and figure out how this works. Would you (or someone else for this matter) maybe be able to provide the lines that I would need to "iterate over all html files, read them run the regexps on them and write them back"? I would much appreciate this. I would think such an example may be generally helpful for the recipes that are based on ePub downloads. Thanks in advance, to whoever finds the time to provide the lines. Cheers, Tobias

02-13-2011, 04:59 PM	#7
tobias2 Member Posts: 18 Karma: 36 Join Date: Feb 2011 Device: Kindle	Awesome, thanks so much, I will give it a try. Cheers, Tobias

Advert

Advert