Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 09-16-2012, 10:43 AM   #1
Agama
Guru
Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.
 
Agama's Avatar
 
Posts: 776
Karma: 2751519
Join Date: Jul 2010
Location: UK
Device: PW2, Nexus7
Most efficient way to process file contents of exploded ePub

I have a plugin which has several functions to support my post-conversion workflow. One function parses contents.opf for xhtml files then applies a set of regexes to their contents. Each xhtml file is read in turn to a list, by using readlines, and then processed line-by-line against all the regexes by using a simple for line in item: statement.

I am now wondering if it would be more efficient coding, and run faster, if I read each xhtml file as a string and then applied the regexes to the whole contents at once, (using DOTALL to span lines)?

Is there an accepted 'best practise' for doing this sort of file processing in Python or is it just down to programmer preference?
Agama is offline   Reply With Quote
Old 09-16-2012, 10:52 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,826
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Using a whole file regex is almost always going to be faster, note the almost, since the answer actually depends on the regex and the I/O vs. CPU profile of the machine it is being run on.
kovidgoyal is online now   Reply With Quote
Advert
Old 09-22-2012, 06:08 PM   #3
Agama
Guru
Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.
 
Agama's Avatar
 
Posts: 776
Karma: 2751519
Join Date: Jul 2010
Location: UK
Device: PW2, Nexus7
I've now written this to run on each whole xhtml file at a time and it's certainly very quick. One thing that I have noticed however: the files start off with LF as the end-of-line marker, but by the time I have written them back the lines end with CR LF. I can't see how I've achieved this, (it's not in my regexes). I use a simple read() and write() and presume these are not the cause.

Any ideas what's doing this? It doesn't seem to stop the ePub from working but it feels messy to have some files with a different line termination.

(Calibre 0.8.69 on Windows 7 x64)
Agama is offline   Reply With Quote
Old 09-22-2012, 11:54 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,826
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Open the file in binary mode, like this

f = open('bname', 'rb') for reading

and

f = open('name', 'wb') for writing.
kovidgoyal is online now   Reply With Quote
Old 09-23-2012, 07:49 AM   #5
Agama
Guru
Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.
 
Agama's Avatar
 
Posts: 776
Karma: 2751519
Join Date: Jul 2010
Location: UK
Device: PW2, Nexus7
So simple! I assumed the 'b' flag was only for non-text files. Thanks.

(I'll be 'supporting calibre' as soon as the PayPal issue is resolved. Calibre's facility for user-written plugins makes an already great application simply brilliant – and adds a whole new level of fun.)
Agama is offline   Reply With Quote
Advert
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Extract table of contents from mobi file oecherprinte Kindle Formats 7 04-16-2012 12:10 PM
Problem with the Table of Contents of the mobi file kindleren Conversion 7 03-04-2012 12:42 PM
My head just exploded junkyardwillie iRex 28 07-15-2009 11:32 AM
Can Mobi books be 'exploded' AnemicOak Kindle Formats 26 03-18-2009 03:16 PM
PRS-500 Can I add a table of contents to a given lrf file? harpum Sony Reader Dev Corner 0 07-13-2007 08:36 PM


All times are GMT -4. The time now is 02:31 AM.


MobileRead.com is a privately owned, operated and funded community.