Behavior changes - Page 2

chaley · 06-22-2011, 02:39 PM

Quote:

Originally Posted by kiwidude

@Kovid - the Find Duplicates plugin for binary comparison does two passes.

The first is to add candidates to a map by getting the os stat size and modified datetime:

Depending on how far we want to look ahead, we might need to abandon this. I can imagine that various cloud implementation don't have this information. On the other hand, there should be no trouble in supplying an API to give some info from 'stat' to the caller.

Quote:

The second pass is on the reduced subset (where size and modified datetime match) to compute a hash for each of those books:

A pipe would work very well for this, as would a memory file.

chaley · 06-22-2011, 02:43 PM

Quote:

Originally Posted by kovidgoyal

Hmm, would the file descriptors be inherited in child processes? If so, that is the perfect solution.

The usual problem here is not inheritance, it is too much inheritance. The implementation must ensure that only the 'right' descriptors are left open in the child. In a producer-consumer relationship, usually the child inherits the read pipe, and must not inherit the write pipe or it will never close. In all cases, the writer must have only the write end and the reader must have only the read end, or file close semantics will not be respected.

Both windows and *nix treat pipes as normal file descriptors with subprocess inheritance. I have used pipe inheritance in native code in both systems. I would be astonished if python doesn't support it. That said, I have been astonished before.

kovidgoyal · 06-22-2011, 03:00 PM

I've added two new functions to the db API

format_metadata() returns size and last modified (as a datetime object)

format_hash()

These should be enough for kiwidude and we can have them do something sensible with a cloud based backend.

kovidgoyal · 06-22-2011, 03:10 PM

Coming to the question of pipes: It seems to me that what we really need is some kind of proxy object that allows the calling of methods from the db API in one process from another process. Jobs running in child processes can then pretty much do anything to the db by calling the appropriate API.

The multiprocessing module has the necessary IPC plumbing to implement this, I believe.

kovidgoyal · 06-22-2011, 03:37 PM

Hmm, multiprocessing has a Connection object that abstracts sockets on unix and named pipes on windows. Unfortunately, it only has poll() not select() which would have pretty bad performance implications, I imagine.

kovidgoyal · 06-22-2011, 03:58 PM

All in all, there doesn't seem to be anyway to do this. I though of using tcp/ip sockets, but then there will be problems on windows with antivirus programs blocking the creation of the sockets.

I dont really see any way out for this except to do what calibre does in this kind of situation, which is first run in process, copy out the data that is being worked on, and then launch a worker process that works on the data and returns the results via the filesystem. This does have performance implications, but for the vast majority of ebook files, the performance hit should be very small.

For jobs that dont do a lot of work on the data, like duplicate finder, run them in process and use either spooledtempfile or special api like format_hash()

That leaves launching external editors on data. At the moment, the only thing I can see is monitoring the temp file for changes. We can add API to have the temp file created outside the normal calibre temp dir, so that if the user leaves the external editor running when quitting calibre, it will not affect temp file cleanup for the other temp files.

kiwidude · 06-22-2011, 04:01 PM

@Kovid - wrt the external editors. How is this thread going to know when you have stopped editing? As it is fairly common to save often (particularly when using Sigil which is so horribly buggy).

kovidgoyal · 06-22-2011, 04:12 PM

It wont, it will wait a little while after each change, and if there are no more, it will update, and on calibre shutdown and waiting files will be updated.

Another possibility is to just disable this functionality for network backend dbs and continue to allow direct access for local dbs. The idea being that if you want to work ona set of files, you move them from the network to a local library, work on them, once your done, move them back.

chaley · 06-22-2011, 04:42 PM

Quote:

Originally Posted by kovidgoyal

All in all, there doesn't seem to be anyway to do this. I though of using tcp/ip sockets, but then there will be problems on windows with antivirus programs blocking the creation of the sockets.

Threads & sockets don't work?

Quote:

I dont really see any way out for this except to do what calibre does in this kind of situation, which is first run in process, copy out the data that is being worked on, and then launch a worker process that works on the data and returns the results via the filesystem. This does have performance implications, but for the vast majority of ebook files, the performance hit should be very small.

If the problem is not producer-consumer, then you are probably right. However, if it is, then I don't understand why subprocesses can't be given pipes.

Quote:

That leaves launching external editors on data. At the moment, the only thing I can see is monitoring the temp file for changes. We can add API to have the temp file created outside the normal calibre temp dir, so that if the user leaves the external editor running when quitting calibre, it will not affect temp file cleanup for the other temp files.

You are probably correct. By definition, we have no control over what the external program does, so we must work at its level of abstraction, which is the file.

It might be possible to avoid polling loops by using pessimistic locking and letting the user to indicate that s/he is finished. This is similar to what kiwidude didn't want, but might be acceptable because it is non-blocking. In effect, we provide an 'export', locking the book object for update. Whatever application does its thing. When finished, run something that imports the results and breaks the lock.

We could choose to go optimistic and not lock anything, with the problem that if the object is updated twice, one of the updates loses, but this also requires the user to indicate that s/he is finished. Optimistic locking works most of the time, but tends to fail spectacularly when it fails.

My thought is that there might be a local 'cache' of the library outside of the temp directory. Exports go there and imports come from there. Both export and import are explicit commands.

Detecting concurrent update isn't hard. The export would have a timestamp/signature. If at import time the object has a different signature, a choice would need to be made -- which wins.

This is fairly classic multi-user DB stuff. Consider airline seat reservation. I see a map, pick a seat, and say 'go', only to be told that someone else has already reserved that seat. Same thing with purchases of items with limited stock.

kovidgoyal · 06-22-2011, 05:04 PM

The problem with threads and sockets is performance, there's no select, only poll for each socket. This will be rather nasty in the thread that manages the connections from child processes, unless we launch a new thread per child. And given that only a single python thread can run at a time...

Given all these complications, to me it's just more reasonable to export in process ->work out of process -> import in process rather than try to do everything out of process. The cost is one extra copy per file. Which given the file sizes of typical books seems fairly reasonable to me.

chaley · 06-22-2011, 05:13 PM

Quote:

Originally Posted by kovidgoyal

The problem with threads and sockets is performance, there's no select, only poll for each socket. This will be rather nasty in the thread that manages the connections from child processes, unless we launch a new thread per child. And given that only a single python thread can run at a time...

I don't think the GIL is really an issue. Python will switch between the threads as it wishes, distributing the time to do the copy over several context switches. Yes, multiple cores are not used, but this is true even if the file is copied. The work gets done in the same amount of time, but threaded solutions better spread that work over time.

Quote:

Given all these complications, to me it's just more reasonable to export in process ->work out of process -> import in process rather than try to do everything out of process. The cost is one extra copy per file. Which given the file sizes of typical books seems fairly reasonable to me.

If the problem is not producer-consumer, then you are absolutely correct. If it is P/C, then I am not convinced.

One thing I don't know is what percentage of operations are what. Readers are P/C, but modifiers are not.

kovidgoyal · 06-22-2011, 05:27 PM

Quote:

Originally Posted by chaley

I don't think the GIL is really an issue. Python will switch between the threads as it wishes, distributing the time to do the copy over several context switches. Yes, multiple cores are not used, but this is true even if the file is copied. The work gets done in the same amount of time, but threaded solutions better spread that work over time.If the problem is not producer-consumer, then you are absolutely correct. If it is P/C, then I am not convinced.

I/O operations do use multiple cores, since python releases the GIL during a read/write. My concern is the you will need to have a loop that looks like

Code:

had_operation = False
for socket in connections:
   if socket.poll():
      #handle the read in a relatively non blocking manner
      had_operation = True
if not had_operation:
   #ensure release of the GIL
    time.sleep(0.01)

That just seems beyond ugly and I really dont see it being performant

Quote:

One thing I don't know is what percentage of operations are what. Readers are P/C, but modifiers are not.

We can effectively make them all readers by requiring the writes to be run in process.

kovidgoyal · 06-22-2011, 05:45 PM

Look at it like this. A job needs to do some work on data from the library. There are two possibilities:

1) If that work is fast/simple it can be run in process like with find duplicates.

2) If it is not, then the time taken for an extra disk-to-disk copy is going to be pretty small compared to the time taken to do the actual work.

chaley · 06-22-2011, 05:53 PM

Quote:

Originally Posted by kovidgoyal

I/O operations do use multiple cores, since python releases the GIL during a read/write. My concern is the you will need to have a loop that looks like

Code:

had_operation = False
for socket in connections:
   if socket.poll():
      #handle the read in a relatively non blocking manner
      had_operation = True
if not had_operation:
   #ensure release of the GIL
    time.sleep(0.01)

That just seems beyond ugly and I really dont see it being performant

If you are using a thread, then the poll goes away. You would do something like

Code:

try:
 with open(x) as f:
   d = f.read(someAmount)
   outpipe.write(d)
finally:
 outpipe.close()

The GIL will be released and threads will switch if the write to outpipe blocks because the queue is full. If the reader goes away, the write to outpipe will throw an exception. This should perform very well.

kovidgoyal · 06-22-2011, 05:58 PM

Hmm, maybe

Let's table this for now. kiwidude can continue to use format_abspath for the moment. Once the new db backend is nearer completion, we can revisit and write some code so that the performance can actually be measured.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
The behavior of Apple	leebase	General Discussions	30	03-17-2011 12:01 AM
strange behavior	zeroh	Nook Color & Nook Tablet	3	12-09-2010 11:14 AM
strange behavior	valb2953	Calibre	1	11-22-2010 01:12 PM
Tag behavior...	guyanonymous	Calibre	1	11-29-2009 02:57 PM

06-22-2011, 03:00 PM	#18
kovidgoyal creator of calibre Posts: 43,843 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I've added two new functions to the db API format_metadata() returns size and last modified (as a datetime object) format_hash() These should be enough for kiwidude and we can have them do something sensible with a cloud based backend.

06-22-2011, 03:10 PM	#19
kovidgoyal creator of calibre Posts: 43,843 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Coming to the question of pipes: It seems to me that what we really need is some kind of proxy object that allows the calling of methods from the db API in one process from another process. Jobs running in child processes can then pretty much do anything to the db by calling the appropriate API. The multiprocessing module has the necessary IPC plumbing to implement this, I believe.

06-22-2011, 03:37 PM	#20
kovidgoyal creator of calibre Posts: 43,843 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Hmm, multiprocessing has a Connection object that abstracts sockets on unix and named pipes on windows. Unfortunately, it only has poll() not select() which would have pretty bad performance implications, I imagine.

06-22-2011, 03:58 PM	#21
kovidgoyal creator of calibre Posts: 43,843 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	All in all, there doesn't seem to be anyway to do this. I though of using tcp/ip sockets, but then there will be problems on windows with antivirus programs blocking the creation of the sockets. I dont really see any way out for this except to do what calibre does in this kind of situation, which is first run in process, copy out the data that is being worked on, and then launch a worker process that works on the data and returns the results via the filesystem. This does have performance implications, but for the vast majority of ebook files, the performance hit should be very small. For jobs that dont do a lot of work on the data, like duplicate finder, run them in process and use either spooledtempfile or special api like format_hash() That leaves launching external editors on data. At the moment, the only thing I can see is monitoring the temp file for changes. We can add API to have the temp file created outside the normal calibre temp dir, so that if the user leaves the external editor running when quitting calibre, it will not affect temp file cleanup for the other temp files.

06-22-2011, 04:01 PM	#22
kiwidude Calibre Plugins Developer Posts: 4,636 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	@Kovid - wrt the external editors. How is this thread going to know when you have stopped editing? As it is fairly common to save often (particularly when using Sigil which is so horribly buggy).

06-22-2011, 04:12 PM	#23
kovidgoyal creator of calibre Posts: 43,843 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	It wont, it will wait a little while after each change, and if there are no more, it will update, and on calibre shutdown and waiting files will be updated. Another possibility is to just disable this functionality for network backend dbs and continue to allow direct access for local dbs. The idea being that if you want to work ona set of files, you move them from the network to a local library, work on them, once your done, move them back.

06-22-2011, 05:04 PM	#25
kovidgoyal creator of calibre Posts: 43,843 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	The problem with threads and sockets is performance, there's no select, only poll for each socket. This will be rather nasty in the thread that manages the connections from child processes, unless we launch a new thread per child. And given that only a single python thread can run at a time... Given all these complications, to me it's just more reasonable to export in process ->work out of process -> import in process rather than try to do everything out of process. The cost is one extra copy per file. Which given the file sizes of typical books seems fairly reasonable to me.

06-22-2011, 05:45 PM	#28
kovidgoyal creator of calibre Posts: 43,843 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Look at it like this. A job needs to do some work on data from the library. There are two possibilities: 1) If that work is fast/simple it can be run in process like with find duplicates. 2) If it is not, then the time taken for an extra disk-to-disk copy is going to be pretty small compared to the time taken to do the actual work.

06-22-2011, 05:58 PM	#30
kovidgoyal creator of calibre Posts: 43,843 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Hmm, maybe Let's table this for now. kiwidude can continue to use format_abspath for the moment. Once the new db backend is nearer completion, we can revisit and write some code so that the performance can actually be measured.

Advert

Advert