Most efficient way to process file contents of exploded ePub

Agama · 09-16-2012, 10:43 AM

I have a plugin which has several functions to support my post-conversion workflow. One function parses contents.opf for xhtml files then applies a set of regexes to their contents. Each xhtml file is read in turn to a list, by using readlines, and then processed line-by-line against all the regexes by using a simple for line in item: statement.

I am now wondering if it would be more efficient coding, and run faster, if I read each xhtml file as a string and then applied the regexes to the whole contents at once, (using DOTALL to span lines)?

Is there an accepted 'best practise' for doing this sort of file processing in Python or is it just down to programmer preference?

kovidgoyal · 09-16-2012, 10:52 AM

Using a whole file regex is almost always going to be faster, note the almost, since the answer actually depends on the regex and the I/O vs. CPU profile of the machine it is being run on.

Agama · 09-22-2012, 06:08 PM

I've now written this to run on each whole xhtml file at a time and it's certainly very quick. One thing that I have noticed however: the files start off with LF as the end-of-line marker, but by the time I have written them back the lines end with CR LF. I can't see how I've achieved this, (it's not in my regexes). I use a simple read() and write() and presume these are not the cause.

Any ideas what's doing this? It doesn't seem to stop the ePub from working but it feels messy to have some files with a different line termination.

(Calibre 0.8.69 on Windows 7 x64)

kovidgoyal · 09-22-2012, 11:54 PM

Open the file in binary mode, like this

f = open('bname', 'rb') for reading

and

f = open('name', 'wb') for writing.

Agama · 09-23-2012, 07:49 AM

So simple! I assumed the 'b' flag was only for non-text files. Thanks.

(I'll be 'supporting calibre' as soon as the PayPal issue is resolved. Calibre's facility for user-written plugins makes an already great application simply brilliant

– and adds a whole new level of fun.)

09-16-2012, 10:43 AM	#1
Agama Guru Posts: 776 Karma: 2751519 Join Date: Jul 2010 Location: UK Device: PW2, Nexus7	Most efficient way to process file contents of exploded ePub I have a plugin which has several functions to support my post-conversion workflow. One function parses contents.opf for xhtml files then applies a set of regexes to their contents. Each xhtml file is read in turn to a list, by using readlines, and then processed line-by-line against all the regexes by using a simple for line in item: statement. I am now wondering if it would be more efficient coding, and run faster, if I read each xhtml file as a string and then applied the regexes to the whole contents at once, (using DOTALL to span lines)? Is there an accepted 'best practise' for doing this sort of file processing in Python or is it just down to programmer preference?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Extract table of contents from mobi file	oecherprinte	Kindle Formats	7	04-16-2012 12:10 PM
Problem with the Table of Contents of the mobi file	kindleren	Conversion	7	03-04-2012 12:42 PM
My head just exploded	junkyardwillie	iRex	28	07-15-2009 11:32 AM
Can Mobi books be 'exploded'	AnemicOak	Kindle Formats	26	03-18-2009 03:16 PM
PRS-500 Can I add a table of contents to a given lrf file?	harpum	Sony Reader Dev Corner	0	07-13-2007 08:36 PM

09-16-2012, 10:52 AM	#2
kovidgoyal creator of calibre Posts: 44,323 Karma: 23661992 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Using a whole file regex is almost always going to be faster, note the almost, since the answer actually depends on the regex and the I/O vs. CPU profile of the machine it is being run on.

09-22-2012, 06:08 PM	#3
Agama Guru Posts: 776 Karma: 2751519 Join Date: Jul 2010 Location: UK Device: PW2, Nexus7	I've now written this to run on each whole xhtml file at a time and it's certainly very quick. One thing that I have noticed however: the files start off with LF as the end-of-line marker, but by the time I have written them back the lines end with CR LF. I can't see how I've achieved this, (it's not in my regexes). I use a simple read() and write() and presume these are not the cause. Any ideas what's doing this? It doesn't seem to stop the ePub from working but it feels messy to have some files with a different line termination. (Calibre 0.8.69 on Windows 7 x64)

09-22-2012, 11:54 PM	#4
kovidgoyal creator of calibre Posts: 44,323 Karma: 23661992 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Open the file in binary mode, like this f = open('bname', 'rb') for reading and f = open('name', 'wb') for writing.

09-23-2012, 07:49 AM	#5
Agama Guru Posts: 776 Karma: 2751519 Join Date: Jul 2010 Location: UK Device: PW2, Nexus7	So simple! I assumed the 'b' flag was only for non-text files. Thanks. (I'll be 'supporting calibre' as soon as the PayPal issue is resolved. Calibre's facility for user-written plugins makes an already great application simply brilliant – and adds a whole new level of fun.)

Advert

Advert