This page describes, at a high level, two mechanisms that are used by Derby Network Server when building data reply structures to fulfill client requests. The first is DSS "chaining", the second is DSS "continuation".

This page also describes how, in cases where a server protocol failure occurs, one can look at the server traces (see tracing) to help determine if the problem might be caused by bugs in the server-side chaining and/or continuation logic.

I) DSS "Chaining"

When sending requests to the server, a client application is allowed to place multiple request structures (called Data Stream Structures, or DSSes) together into a single buffer and then send the whole group to the server at one time. In such a case, the DSSes make up a "chain" and the server, in turn, must send all corresponding reply DSSes back to the client as a chain, as well.

In the Distributed Data Management (DDM) manual found here, DSS chaining is described in the "Terms" chapter under "DSS".

When chaining, the client specifies a "request correlator" (also know as a "correlation id") for each DSS. In some cases, two or more DSSes might be part of the same command, in which case those DSSes will have identical correlation ids. If two sequential DSSes, DSS_A and DSS_B, are in a chain but are part of different commands, then DSS_B must have a correlation id that is greater than that of DSS_A. The server, then, must retain these correlation ids when sending reply objects back to the client. Thus, if the client sends a DSS request with correlation id "0001", the DSS reply from the server must have correlation id "0001", as well. If multiple request DSSes from the client have the same correlation id, then all of the corresponding reply DSSes from the server must have that same correlation id, as well. And finally, in cases where a single client DSS request necessitates multiple DSS replies from the server, all of the reply DSSes must be chained with the same correlation id (and that id must match the client request id).

All of that said, protocol failures that are caused by incorrect chaining usually manifest themselves in one of two ways within the server trace file: either 1) the chaining state of reply DSSes sent by the server doesn't match that of the request DSSes sent by the client, or 2) the server returns an incorrect correlation id. Both of these problems can be seen from looking at the server-side trace file.

As found in the DDM manual for "DSS", a DSS header contains (among other things) the following info:

cl Cf rc

where

Example of a DSS Header:

(cl)

(Cf)

(rc)

0048

D042

0001

In a server trace, it's usually easiest to pick out the DSS header by searching for "D0", which is the "C" in the above format.

Immediately after "D0" is the "f" byte, which has two parts: a) the first part tells whether or not the current DSS is chained to any subsequent DSSes; this will be either "0", "4", or "5" (see "RPYDSS" in the DDM manual); and b) what kind of DSS this is (either "1" (request), "2" (reply), "3" (object), "4" (encrypted object), or "5" (request that does not expect a reply") -- see "DSSFMT" in the DDM manual).

Following the "f" byte is the two-byte correlation id of the DSS, listed as "rc" above.

** NOTE: not all "D0"s will be part of a DSS header! Sometimes those characters just show up as part of the data; one must look at the bytes following "D0" and determine whether or not it's part of a DSS header based on whether or not the subsequent bytes look like reasonable "f" and "rc" fields.

With that in mind, one can say with confidence that a chaining error has occured if any of the following four conditions is true for a SEND BUFFER in the server trace. (Note that a "_" in the following text means "some hexadecimal character", and "...." is used to indicate data.)

1 - The "f" byte for some DSS is "0_", indicating that there is no chaining--and yet a subsequent DSS is sent to the client, either within the same buffer or within a buffer that immediately follows the DSS in question (i.e. there is no intervening client request). In this case, the server is sending two unchained reply DSSes in a row, which breaks DRDA protocol.

2 - The "f" byte for some DSS, call it DSS_A, is "4_", indicating that another DSS, call it DSS_B, is chained to DSS_A with a different correlation id--and yet either a) DSS_B does not exist (the data ends with DSS_A), or b) the correlation id for DSS_B is less than or equal to the correlation id of DSS_A. In this case the server is sending an invalid correlation id back to the client, which will break protocol (see "RQSCRR" and "RPYDSS" in the DDM manual).

3 - The "f" byte for some DSS, call it DSS_A, is "5_", indicating that another DSS, call it DSS_B, is chained to DSS_A with the same correlation id as DSS_A--and yet either a) DSS_B does not exist (the data ends with DSS_A), or b) the correlation id for DSS_B does not equal the correlation id of DSS_A. In this case the server is sending an invalid correlation id back to the client, which will break protocol (see "RPYDSS" in the DDM manual).

4 - The correlation id(s) for the DSS(es) returned by the server do not match the correlation ids of the DSSes that were received from the client. For every client DSS received, the server should have one or more reply DSSes with the same correlation id, unless the "f" byte in the corresponding request DSS is "_5", in which case no reply DSS is returned. If this isn't true, then the server reply DSSes are not in line with protocol (see "RPYDSS" in the DDM manual).

If none of the above four phenomena shows up in the server trace file, it is still possible that something is wrong with server-side chaining. However, it is more likely that, in such a case, the problem resides elsewhere.

II) DSS "Continuation"

In situations where the server's reply contains more data than can fit into a single DSS, the server must split the data into multiple DSSes before sending it back to the client. Since these "continuation" DSSes originate on the server side, the server must indicate to the client, via a DSS header flag, that all of the DSSes are part of the same reply.

There are several known issues with DSS continuation in the Network Server. For example, see:

DERBY-125, DERBY-170, DERBY-491, DERBY-492, DERBY-529.

All of these cases deal with the server's attempt to build a reply DSS to hold more than 32k of data. Since the maximum size of a DSS, as defined by the DRDA protocol, is 32767 bytes (because the length field is two bytes long and the high-bit is the continuation flag; see "RQSDSS" or "RPYDSS" in the DDM manual), the server needs to split the data into multiple continuation DSSes when sending it to the client. On top of that, the server also has to send data in such a way as to avoid over-running its own send buffer, which happens to have a max length of 32k, as well (note that a single DSS is allowed to span multiple contiguous send buffers).

All of that said, it appears that in certain situations, the server's attempts to divide data into DSSes of length 32k (or less) while at the same dividing it up to fit into one or more 32k send buffers are resulting in malformed data structures, which in turn manifest themselves in a variety of ways (including "invalid codepoint" errors, protocol exceptions, server-side ASSERT failures, and hangs--all of which occur in at least one of the above-mentioned Jira issues).

Generally speaking, the following are some things to look for in the server trace file when trying to determine if a failure is caused by incorrect continuation processing:

1 - If a single SEND BUFFER in the trace file has a length approaching (or surpassing) 32767 (0x7FFF) at the time of the failure, then odds are good that processing went awry when the server tried to break the data into 32k DSSes and/or 32k send buffers.

2 - If a single DSS (which may span multiple contiguous send buffers) is longer than 32k, then the continuation logic in the server is incorrect (or missing) somewhere, resulting in a DSS that breaks protocol. The way to check this is to look for two sequential DSS headers (see section I above) in one or more contiguous send buffers and see how many bytes are between them; if it's more than 32k, there's a DSS continuation problem.

3 - If the "cl" field of a DSS header (i.e. the two bytes immediately preceding the "D0" byte) has the high-order bit set to 1, this means that the server was attempting to write a continuation DSS. That in itself doesn't guarantee that the failure was caused by a continuation problem, but it's a good cause for suspicion--especially if the DSS that has the bit set was the last one sent before the failure occured.

If none of these phenomena show up in the server trace file, the next thing to look for is an ASSERT failure in the server log file (assuming "sane" build), such as:

ASSERT FAILED Unexpected data size or position. sz=-29031 count=3736 pos=32767 org.apache.derby.iapi.services.sanity.AssertFailure: ASSERT FAILED Unexpected data size or position. sz=-29031 count=3736 pos=32767

(The above failure is what you will see if you run the repro for DERBY-170).

Outside of that, it's up to the person investigating the failure to sort through server code and client/server traces to determine if the problem is somehow related to continuation DSSes. Unfortunately, it's not always easy to tell...

DssProtocolErrors (last edited 2009-09-20 22:12:22 by localhost)