Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 06-22-2011, 02:39 PM   #16
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,734
Karma: 6690881
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Quote:
Originally Posted by kiwidude View Post
@Kovid - the Find Duplicates plugin for binary comparison does two passes.

The first is to add candidates to a map by getting the os stat size and modified datetime:
Depending on how far we want to look ahead, we might need to abandon this. I can imagine that various cloud implementation don't have this information. On the other hand, there should be no trouble in supplying an API to give some info from 'stat' to the caller.
Quote:
The second pass is on the reduced subset (where size and modified datetime match) to compute a hash for each of those books:
A pipe would work very well for this, as would a memory file.
chaley is offline   Reply With Quote
Old 06-22-2011, 02:43 PM   #17
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,734
Karma: 6690881
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Quote:
Originally Posted by kovidgoyal View Post
Hmm, would the file descriptors be inherited in child processes? If so, that is the perfect solution.
The usual problem here is not inheritance, it is too much inheritance. The implementation must ensure that only the 'right' descriptors are left open in the child. In a producer-consumer relationship, usually the child inherits the read pipe, and must not inherit the write pipe or it will never close. In all cases, the writer must have only the write end and the reader must have only the read end, or file close semantics will not be respected.

Both windows and *nix treat pipes as normal file descriptors with subprocess inheritance. I have used pipe inheritance in native code in both systems. I would be astonished if python doesn't support it. That said, I have been astonished before.
chaley is offline   Reply With Quote
Advert
Old 06-22-2011, 03:00 PM   #18
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I've added two new functions to the db API

format_metadata() returns size and last modified (as a datetime object)

format_hash()

These should be enough for kiwidude and we can have them do something sensible with a cloud based backend.
kovidgoyal is offline   Reply With Quote
Old 06-22-2011, 03:10 PM   #19
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Coming to the question of pipes: It seems to me that what we really need is some kind of proxy object that allows the calling of methods from the db API in one process from another process. Jobs running in child processes can then pretty much do anything to the db by calling the appropriate API.

The multiprocessing module has the necessary IPC plumbing to implement this, I believe.
kovidgoyal is offline   Reply With Quote
Old 06-22-2011, 03:37 PM   #20
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Hmm, multiprocessing has a Connection object that abstracts sockets on unix and named pipes on windows. Unfortunately, it only has poll() not select() which would have pretty bad performance implications, I imagine.
kovidgoyal is offline   Reply With Quote
Advert
Old 06-22-2011, 03:58 PM   #21
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
All in all, there doesn't seem to be anyway to do this. I though of using tcp/ip sockets, but then there will be problems on windows with antivirus programs blocking the creation of the sockets.

I dont really see any way out for this except to do what calibre does in this kind of situation, which is first run in process, copy out the data that is being worked on, and then launch a worker process that works on the data and returns the results via the filesystem. This does have performance implications, but for the vast majority of ebook files, the performance hit should be very small.

For jobs that dont do a lot of work on the data, like duplicate finder, run them in process and use either spooledtempfile or special api like format_hash()

That leaves launching external editors on data. At the moment, the only thing I can see is monitoring the temp file for changes. We can add API to have the temp file created outside the normal calibre temp dir, so that if the user leaves the external editor running when quitting calibre, it will not affect temp file cleanup for the other temp files.
kovidgoyal is offline   Reply With Quote
Old 06-22-2011, 04:01 PM   #22
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,636
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
@Kovid - wrt the external editors. How is this thread going to know when you have stopped editing? As it is fairly common to save often (particularly when using Sigil which is so horribly buggy).
kiwidude is offline   Reply With Quote
Old 06-22-2011, 04:12 PM   #23
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
It wont, it will wait a little while after each change, and if there are no more, it will update, and on calibre shutdown and waiting files will be updated.

Another possibility is to just disable this functionality for network backend dbs and continue to allow direct access for local dbs. The idea being that if you want to work ona set of files, you move them from the network to a local library, work on them, once your done, move them back.
kovidgoyal is offline   Reply With Quote
Old 06-22-2011, 04:42 PM   #24
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,734
Karma: 6690881
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Quote:
Originally Posted by kovidgoyal View Post
All in all, there doesn't seem to be anyway to do this. I though of using tcp/ip sockets, but then there will be problems on windows with antivirus programs blocking the creation of the sockets.
Threads & sockets don't work?
Quote:
I dont really see any way out for this except to do what calibre does in this kind of situation, which is first run in process, copy out the data that is being worked on, and then launch a worker process that works on the data and returns the results via the filesystem. This does have performance implications, but for the vast majority of ebook files, the performance hit should be very small.
If the problem is not producer-consumer, then you are probably right. However, if it is, then I don't understand why subprocesses can't be given pipes.
Quote:
That leaves launching external editors on data. At the moment, the only thing I can see is monitoring the temp file for changes. We can add API to have the temp file created outside the normal calibre temp dir, so that if the user leaves the external editor running when quitting calibre, it will not affect temp file cleanup for the other temp files.
You are probably correct. By definition, we have no control over what the external program does, so we must work at its level of abstraction, which is the file.

It might be possible to avoid polling loops by using pessimistic locking and letting the user to indicate that s/he is finished. This is similar to what kiwidude didn't want, but might be acceptable because it is non-blocking. In effect, we provide an 'export', locking the book object for update. Whatever application does its thing. When finished, run something that imports the results and breaks the lock.

We could choose to go optimistic and not lock anything, with the problem that if the object is updated twice, one of the updates loses, but this also requires the user to indicate that s/he is finished. Optimistic locking works most of the time, but tends to fail spectacularly when it fails.

My thought is that there might be a local 'cache' of the library outside of the temp directory. Exports go there and imports come from there. Both export and import are explicit commands.

Detecting concurrent update isn't hard. The export would have a timestamp/signature. If at import time the object has a different signature, a choice would need to be made -- which wins.

This is fairly classic multi-user DB stuff. Consider airline seat reservation. I see a map, pick a seat, and say 'go', only to be told that someone else has already reserved that seat. Same thing with purchases of items with limited stock.
chaley is offline   Reply With Quote
Old 06-22-2011, 05:04 PM   #25
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
The problem with threads and sockets is performance, there's no select, only poll for each socket. This will be rather nasty in the thread that manages the connections from child processes, unless we launch a new thread per child. And given that only a single python thread can run at a time...

Given all these complications, to me it's just more reasonable to export in process ->work out of process -> import in process rather than try to do everything out of process. The cost is one extra copy per file. Which given the file sizes of typical books seems fairly reasonable to me.
kovidgoyal is offline   Reply With Quote
Old 06-22-2011, 05:13 PM   #26
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,734
Karma: 6690881
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Quote:
Originally Posted by kovidgoyal View Post
The problem with threads and sockets is performance, there's no select, only poll for each socket. This will be rather nasty in the thread that manages the connections from child processes, unless we launch a new thread per child. And given that only a single python thread can run at a time...
I don't think the GIL is really an issue. Python will switch between the threads as it wishes, distributing the time to do the copy over several context switches. Yes, multiple cores are not used, but this is true even if the file is copied. The work gets done in the same amount of time, but threaded solutions better spread that work over time.
Quote:
Given all these complications, to me it's just more reasonable to export in process ->work out of process -> import in process rather than try to do everything out of process. The cost is one extra copy per file. Which given the file sizes of typical books seems fairly reasonable to me.
If the problem is not producer-consumer, then you are absolutely correct. If it is P/C, then I am not convinced.

One thing I don't know is what percentage of operations are what. Readers are P/C, but modifiers are not.
chaley is offline   Reply With Quote
Old 06-22-2011, 05:27 PM   #27
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Quote:
Originally Posted by chaley View Post
I don't think the GIL is really an issue. Python will switch between the threads as it wishes, distributing the time to do the copy over several context switches. Yes, multiple cores are not used, but this is true even if the file is copied. The work gets done in the same amount of time, but threaded solutions better spread that work over time.If the problem is not producer-consumer, then you are absolutely correct. If it is P/C, then I am not convinced.
I/O operations do use multiple cores, since python releases the GIL during a read/write. My concern is the you will need to have a loop that looks like

Code:
had_operation = False
for socket in connections:
   if socket.poll():
      #handle the read in a relatively non blocking manner
      had_operation = True
if not had_operation:
   #ensure release of the GIL
    time.sleep(0.01)
That just seems beyond ugly and I really dont see it being performant

Quote:
One thing I don't know is what percentage of operations are what. Readers are P/C, but modifiers are not.
We can effectively make them all readers by requiring the writes to be run in process.
kovidgoyal is offline   Reply With Quote
Old 06-22-2011, 05:45 PM   #28
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Look at it like this. A job needs to do some work on data from the library. There are two possibilities:

1) If that work is fast/simple it can be run in process like with find duplicates.

2) If it is not, then the time taken for an extra disk-to-disk copy is going to be pretty small compared to the time taken to do the actual work.
kovidgoyal is offline   Reply With Quote
Old 06-22-2011, 05:53 PM   #29
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,734
Karma: 6690881
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Quote:
Originally Posted by kovidgoyal View Post
I/O operations do use multiple cores, since python releases the GIL during a read/write. My concern is the you will need to have a loop that looks like

Code:
had_operation = False
for socket in connections:
   if socket.poll():
      #handle the read in a relatively non blocking manner
      had_operation = True
if not had_operation:
   #ensure release of the GIL
    time.sleep(0.01)
That just seems beyond ugly and I really dont see it being performant
If you are using a thread, then the poll goes away. You would do something like
Code:
try:
 with open(x) as f:
   d = f.read(someAmount)
   outpipe.write(d)
finally:
 outpipe.close()
The GIL will be released and threads will switch if the write to outpipe blocks because the queue is full. If the reader goes away, the write to outpipe will throw an exception. This should perform very well.
chaley is offline   Reply With Quote
Old 06-22-2011, 05:58 PM   #30
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,843
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Hmm, maybe Let's table this for now. kiwidude can continue to use format_abspath for the moment. Once the new db backend is nearer completion, we can revisit and write some code so that the performance can actually be measured.
kovidgoyal is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
The behavior of Apple leebase General Discussions 30 03-17-2011 12:01 AM
strange behavior zeroh Nook Color & Nook Tablet 3 12-09-2010 11:14 AM
strange behavior valb2953 Calibre 1 11-22-2010 01:12 PM
Tag behavior... guyanonymous Calibre 1 11-29-2009 02:57 PM


All times are GMT -4. The time now is 04:39 PM.


MobileRead.com is a privately owned, operated and funded community.