Added section on benign wakeups and notifyAll
converted to 1.6 markup
|Deletions are marked like this.||Additions are marked like this.|
|Line 324:||Line 324:|
|[http://gatekeeper.dec.com/pub/DEC/SRC/research-reports/SRC-035.pdf "An Introduction to Programming with Threads"]||[[http://gatekeeper.dec.com/pub/DEC/SRC/research-reports/SRC-035.pdf|"An Introduction to Programming with Threads"]]|
The Network Server is capable of managing multiple simultaneous JDBC connections with its clients. We call this "Session Management", and this page discusses some of the intricacies of the implementation.
This page was constructed as part of the work on DERBY-1326, DERBY-1219, DERBY-1338, DERBY-1020, and DERBY-51 (and possibly others?), although the page is intended to endure after the bugs are fixed, providing background information about the implementation.
The principal objects involved in session management are the Session and the DRDAConnThread.
The server considers threads to be more expensive than socket connections. The server is generally willing to accept an arbitrarily large number of socket connections from clients at the same time, but has a limit on the number of threads that it is willing to create. So there may be more Session instances than DRDAConnThread instances.
Also, the server is willing to discard the Session instance when the client disconnects, whereas Thread creation and destruction is considered expensive, so the server caches and re-uses threads.
So the server tracks threads and sessions separately, and uses a queueing model to associate work requests (Session) with workers (Threads).
There is a Session instance for each physical socket connection that the server has with a client. Via the Session object, the Network Server can access the various Database objects that the session is using.
Session instances are created by the Client Thread, and are tracked in the Network Server's sessionTable.
There are typically several DRDAConnThread instances in the Network Server. These threads carry out work on behalf of Session instances.
DRDAConnThread instances are created by the Client Thread, and are tracked in the Network Server's thread list.
Dispatching work to threads
The server uses a very simple mechanism, called the RunQueue, to rendezvous and dispatch work to threads.
Any DRDAConnThread instance can service any Session instance, so the Network Server uses a very simple first-come,first-serve policy for matching up sessions and threads. To avoid starvation, there is a time-slicing mechanism as well.
Conceptually, the RunQueue may be in any of three states at any given time:
If there are more Session needing work than there are available DRDAConnThread instances, than the RunQueue holds the sessions which are awaiting a thread.
If there are more DRDAConnThread instances than there are Session needing work, then the idle threads wait on the RunQueue for work to become ready.
If there are exactly as many active Session instances as there are DRDAConnThread instances, then the RunQueue is empty and inactive.
Thus, depending on the current balance of work-needing-a-thread versus threads-needing-work-to-do, the RunQueue at any given instant may be viewed either as a queue of work (Session instances), or as a queue of threads (DRDAConnThread instances).
Of course, this is strictly a conceptual viewpoint, because at the physical level, the RunQueue as a collection object (it is a Java Vector) only ever actually holds Session instances. When the server is in the state of "there are idle threads waiting for work to do", there is no explicit collection object which contains that list of idle threads; instead there is merely a "virtual" collection defined as "the threads which are currently waiting on the RunQueue in NetworkServerImpl.getNextSession()".
A Session instance is created by the Client Thread when a new socket connection is accepted from a client. The Session object is added to the Network Server sessionTable at that time.
The Session object is held until the client disconnects, which is typically handled by throwing a disconnect exception on the server side. The DRDAConnThread catches the disconnect exception, removes the session from the Network Server sessionTable, closes it, and discards it.
(I think that if the client disconnects at a point where there are currently more Session instances than DRDAConnThread instances, and the client's Session is currently on the RunQueue, then the Network Server will not detect the client disconnect until the next time that the Session moves to the head of the RunQueue and is dispatched to a thread for processing. But I haven't yet set up this situation in the debugger, so I'm not sure about this part.)
Session objects are also closed and discarded when the Network Server shuts down or is restarted.
Session closing and DERBY-1020
Bug DERBY-1020 is related to the proper handling of exceptions when an exception occurs closing the session especially after the database has been shutdown.
There may be some synchronization impact as well because the exceptions that occur when the database is intentionally shutdown are treated as unexpected exceptions and it will not go through the normal session shutdown codepath.
Note that DERBY-1020 is the root cause of DERBY-273 (now fixed) and DERBY-803, so there appear to be a number of problems with exceptions that occur during shutdown processing.
A DRDAConnThread instance is created by the Client Thread when it detects that a new Session has just been created, and there aren't any free threads currently, and the server has not yet reached its maximum number of threads.
Note that since there is no explicit collection object which contains the list of idle threads, the server code maintains a simple counter of the number of idle threads.
DRDAConnThread instances are only destroyed when the Network Server shuts down or is restarted. The mechanism for shutting down a DRDAConnThread instance is cooperative (that is, we don't call Thread.stop()): the Network Server calls the thread's close() method, which sets a flag on the thread, and the thread checks the flag periodically and exits its run() method if it discovers the closed() flag is set.
Note specifically here, with respect to DERBY-1326, that when the DRDAConnThread notices that it has been closed, immediately upon returning from the getNextSession() call in its run() method, that it does not perform any shutdown processing for the Session that it was told to process. This can cause a "hang", as discussed in DERBY-1219.
Network Server startup
When the Network Server starts up, it starts the embedded Derby engine, but does not start any Session or DRDAConnThread. Those objects will be created on demand when new client connections are made, as described above.
Network Server shutdown
When the Network Server shuts down, it makes a concerted effort to clean up its resources:
- It shuts down the Client Thread
It goes through the sessionTable and closes all the Session objects
It goes through the Thread list and closes and interrupts each DRDAConnThread
- It closes the master socket on which it accepts connections
There are several observations worth making here:
- It seems like it might be better to close the master socket first, rather than at the end, since otherwise there is the chance that new connections will continue to accumulate during the shutdown processing (though since the Client Thread has been interrupted, no new work should get started for such connection attempts).
- Note that we do not close down the embedded Derby engine here. This is the essence of bug DERBY-51. Basically, the Network Server does not know whether or not it should shut down the embedded engine, because it is possible to have an application in which there are both local threads accessing the database directly, and client threads accessing the database over the network, and just because we are shutting down the Network Server does not necessarily mean that we no longer want to access the database from the local threads.
Lastly, note that in addition to closing each DRDAConnThread, we also call interrupt() on those threads. This awakens the thread if it was blocked in a wait() call, so that it can notice the fact that it's been closed.
The implications of interrupting the DRDAConnThread instances
Thread.interrupt() is a very powerful call, and the fact that we make this call during server shutdown can have unforeseen consequences. Consider, for example, DERBY-1338, in which the Thread.interrupt() call made during server shutdown has the unwanted side effect of interrupting the loading of a class file, causing the DRDAConnThread instance to report a ClassNotFound exception.
In general, it seems like it would be preferable to design a protocol in which we do not need to interrupt the daemon thread instances, or, if we do, that we only do so at a point where we are certain that it is safe to interrupt the threads.
Network Server restart
The Network Server has a feature by which it can restart itself, without needing to shut down and restart the entire application which contains the Network Server. The Network Server does this automatically when it detects that the underlying Derby engine has been shut down? The checkDataSource test contains some code which triggers this restart processing.
When the Network Server restarts itself, it does the following:
It goes through the RunQueue and closes all sessions in the RunQueue.
It goes through each DRDAConnThread and closes the thread
- It releases its pointer to the Derby engine, to allow it to be collected.
- It then starts up a new instance of the Derby engine.
There are several observations worth making here.
Note that while Network Server shutdown closes every session in the sessionTable, Network Server restart only closes the sessions in the RunQueue.
Note that while Network Server shutdown both closes() and interrupts() each DRDAConnThread instance, Network Server restart just closes() the threads. This means that the threads may not become aware that they have been closed until some later time, when they awaken from whatever wait() call they were blocked in.
Note that Network Server shutdown terminates the Client Thread, while Network Server restart leaves the Client Thread alone. This means that new Session may be coming in during the restart processing.
There are several monitors used by the Network Server session management code for synchronization:
The serverStartSync object has exactly one use, which is to single-thread the flow through the startNetworkServer() method so that the network server is always started (or restarted) exactly once, no matter how many threads call this routine at the same time.
The serverStartSync monitor is used in conjunction with the restartFlag boolean, and there is a comment at the head of the startNetworkServer() method regarding the use of these two variables which I don't understand.
The sessionTable object is a Java Hashtable, which is an automatically-synchronized collection object. Therefore calls to sessionTable.put(), sessionTable.get() and sessionTable.remove() do not need to be synchronized by the caller; the Hashtable will perform internal synchronization automatically.
However, when traversing an enumeration of the elements in the Hashable, the automatic synchronization does not help, so in these cases the Network Server code explicitly encloses these traversals in synchronized blocks.
The threadList object is a Java Vector, which is also an automatically-synchronized collection object. So, again, calls to threadList.addElement(), threadList.get(), threadList.size(), etc. do not need to be synchronized by the caller.
However, when traversing the entire threadList, the automatic synchronization does not help, so in these cases the Network Server code explicitly encloses the traversals in synchronized blocks.
Unfortunately, the Network Server code is inconsistent about these traversals. These traversals are enclosed in synchronized blocks:
However, these traversals are not enclosed in synchronized blocks:
in the ClientThread logic which decides whether to add a thread
This inconsistent synchronization is not explained in the code.
The runQueue object is also a Java Vector, so again it is an automatically-synchronized object, and again the collection traversals need to be explicitly synchronized.
Furthermore, the runQueue object is also used for wait() and notify() calls, so some additional synchronization is needed to perform these operations.
And the runQueue synchronization is also used to protect the freeThreads variable, which is the count of the number of threads currently waiting on the runQueue object.
Unfortunately, the Network Server code is again inconsistent in its use of the explicit synchronization. There is inconsistent or missing synchronization in at least these locations:
in the ClientThread logic which decides whether to add a thread
And it seems to me that the getFreeThreads() method is inherently dangerous, since even though it is internally synchronized, the instant that the method returns to its caller the value could change.
Possible ways to simplify/repair the synchronization
Here are some ideas I've had about possible changes to the synchronization:
I think that the open-coded logic in ClientThread.java which handles creating new Session and DRDAConnThread objects should be moved into the NetworkServerImpl class so that it can be uniformly synchronized with the other logic in that class.
I think that it is confusing to use the runQueue as both a collection object and as a thread wait queue, and it would be simpler and less error-prone to use two separate objects for that.
I think that primitive methods like getThreadList() and getFreeThreads() should be avoided, as they risk exposing too much of the implementation to callers, and allowing race conditions to be accidentally introduced.
freeThreads counter maintenance
As discussed earlier, there is no explicit collection object which contains the set of threads which are currently idle, waiting for work to do.
Instead, the Network Server maintains an integer variable, called freeThreads, to count the number of such threads. Immediately before calling runQueue.wait(), the thread increments this counter, and immediately after returning from runQueue.wait(), the thread decrements this counter.
However, it is also possible for the thread to be interrupted, in which case it doesn't return normally from runQueue.wait(), but rather throws an InterruptedException from that method.
The getNextSession() method handles this possibility by catching the InterruptedException message, and decrementing the counter in this case as well.
However, when the Network Server restarts, it also unconditionally sets the counter to 0. This means that there is a latent problem, which I uncovered when I experimented with having the Network Server interrupt the threads during restart processing. What happens is as follows:
- The Network Server restarts
During restart processing, it interrupts the DRDAConnThreads
- The Network Server then sets the freeThreads counter to 0
The threads which have been interrupted then awaken, catch the InterruptedException, and decrement the counter.
The counter then goes negative, and becomes completely useless. The Client Thread thinks that there are plenty of free threads, because the free threads counter is not 0, so it never starts any new threads, but in fact there aren't any free threads (there are -1 or -2 or some number like that).
The symptom is that, after the Network Server is restarted, assuming that it interrupts the threads during restart processing, it then never starts any new threads, although it continues to accept new TCP/IP connections and creates new Session objects.
The fix, I believe, is simply to remove the unconditional setting of the freeThreads counter to 0 in the restart processing, and let the incrementing and decrementing logic in getNextSession() handle all the maintenance of this counter.
Benign wakeups and notifyAll
There is a standard convention for thread programming, which dates back to Andrew Birrell's classic paper, "An Introduction to Programming with Threads" for handling the waiting and waking up of threads.
As Birrell explains (pages 14-15), threads which are waiting on a resource should always be coded as:
WHILE NOT expression DO Thread.Wait(m,c) END;
You will note that the getNextSession() method is already coded in the correct style, which is good.
When the wait loop is coded in this fashion, Birrell explains, one benefit is (p. 15):
- A final advantage of this convention is that it allows for simple programming of calls to “Signal” or “Broadcast”—extra wake-ups are benign. Carefully coding to ensure that only the correct threads are awoken is now only a performance question, not a correctness one (but of course you must ensure that at least the correct threads are awoken).
Here is where there is a slight issue in the NetworkServer code. In the runQueueAdd() method, note that we currently issue runQueue.notify(). This means that we are waking up at most one thread to go handle the newly-queued Session.
Unfortunately, at this point, runQueueAdd() does not have the proper information to be certain that it is waking up the correct thread. In particular, in the case of a NetworkServer restart, what if some of the waiting threads happened to be "poisoned" (threads that had been closed by the restart code)?
If we simply change runQueueAdd() so that it calls notifyAll(), rather than notify(), then everything will still function properly even if the current mix of threads waiting on the runQueue contains both closed threads and non-closed threads. All of those threads will be awoken; one non-closed thread will eventually grab the session and go run it; all the other threads will either detect that they have been closed and break out of getNextSession() or detect that there are no more Sessions to run and go back to sleep.
Changing the Session Management code
I think that there are a number of problems in the Session Management code which need to be addressed, as part of the bugs DERBY-1326 and DERBY-51, and possibly others.
However, before we proceed to changing the code, I want to make sure that we are correctly understanding the current operation of the code, so I'm only taking the wiki page this far right now, and I intend to return to this page at a later date and start to propose changes to the code, once we've reviewed the material so far.