09-16-2012, 10:43 AM | #1 |
Guru
Posts: 776
Karma: 2751519
Join Date: Jul 2010
Location: UK
Device: PW2, Nexus7
|
Most efficient way to process file contents of exploded ePub
I have a plugin which has several functions to support my post-conversion workflow. One function parses contents.opf for xhtml files then applies a set of regexes to their contents. Each xhtml file is read in turn to a list, by using readlines, and then processed line-by-line against all the regexes by using a simple for line in item: statement.
I am now wondering if it would be more efficient coding, and run faster, if I read each xhtml file as a string and then applied the regexes to the whole contents at once, (using DOTALL to span lines)? Is there an accepted 'best practise' for doing this sort of file processing in Python or is it just down to programmer preference? |
09-16-2012, 10:52 AM | #2 |
creator of calibre
Posts: 44,323
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Using a whole file regex is almost always going to be faster, note the almost, since the answer actually depends on the regex and the I/O vs. CPU profile of the machine it is being run on.
|
Advert | |
|
09-22-2012, 06:08 PM | #3 |
Guru
Posts: 776
Karma: 2751519
Join Date: Jul 2010
Location: UK
Device: PW2, Nexus7
|
I've now written this to run on each whole xhtml file at a time and it's certainly very quick. One thing that I have noticed however: the files start off with LF as the end-of-line marker, but by the time I have written them back the lines end with CR LF. I can't see how I've achieved this, (it's not in my regexes). I use a simple read() and write() and presume these are not the cause.
Any ideas what's doing this? It doesn't seem to stop the ePub from working but it feels messy to have some files with a different line termination. (Calibre 0.8.69 on Windows 7 x64) |
09-22-2012, 11:54 PM | #4 |
creator of calibre
Posts: 44,323
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Open the file in binary mode, like this
f = open('bname', 'rb') for reading and f = open('name', 'wb') for writing. |
09-23-2012, 07:49 AM | #5 |
Guru
Posts: 776
Karma: 2751519
Join Date: Jul 2010
Location: UK
Device: PW2, Nexus7
|
So simple! I assumed the 'b' flag was only for non-text files. Thanks.
(I'll be 'supporting calibre' as soon as the PayPal issue is resolved. Calibre's facility for user-written plugins makes an already great application simply brilliant – and adds a whole new level of fun.) |
Advert | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Extract table of contents from mobi file | oecherprinte | Kindle Formats | 7 | 04-16-2012 12:10 PM |
Problem with the Table of Contents of the mobi file | kindleren | Conversion | 7 | 03-04-2012 12:42 PM |
My head just exploded | junkyardwillie | iRex | 28 | 07-15-2009 11:32 AM |
Can Mobi books be 'exploded' | AnemicOak | Kindle Formats | 26 | 03-18-2009 03:16 PM |
PRS-500 Can I add a table of contents to a given lrf file? | harpum | Sony Reader Dev Corner | 0 | 07-13-2007 08:36 PM |