Differences between revisions 11 and 12
Revision 11 as of 2007-07-26 17:20:13
Size: 10545
Comment: Removed ref to unrelated DERBY-2916
Revision 12 as of 2009-09-20 22:11:52
Size: 10547
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
As part of the ["OLAPOperations"] effort, we'd like to add support for null ordering. Jira issue [https://issues.apache.org/jira/browse/DERBY-2887 DERBY-2887] tracks this effort. As part of the [[OLAPOperations]] effort, we'd like to add support for null ordering. Jira issue [[https://issues.apache.org/jira/browse/DERBY-2887|DERBY-2887]] tracks this effort.

As part of the OLAPOperations effort, we'd like to add support for null ordering. Jira issue DERBY-2887 tracks this effort.

Rationale

In earlier versions of the SQL standard, the ordering of NULL values was implementation-defined.

The SQL 2003 standard allows the user to explicitly specify how to sort NULL values, by adding the new NULLS FIRST or NULLS LAST specification in the ORDER BY clause.

In Derby versions 10.0 through 10.3, NULL values always sort high, and thus are always ordered after all non-NULL values in ascending order and before all non-NULL values in descending order. In the language of the standard, NULLS LAST is the default for ascending order, and NULLS FIRST is the default for descending order.

Syntax

In the SQL 2003 standard, the <sort specification list> has the following grammar:

<sort specification list> ::=
    <sort specification> [ { <comma> <sort specification> }... ]

<sort specification> ::=
    <sort key> [ <ordering specification> ] [ <null ordering> ]

<sort key> ::= <value expression>

<ordering specification> ::=
    ASC
  | DESC

<null ordering> ::=
    NULLS FIRST
  | NULLS LAST

The new element here is the <null ordering> specification, which is optional.

If the <null ordering> is not specified, then an implementation-defined <null ordering> is used. For Derby, this will continue to be NULLS LAST if the sort is ASCending, and NULLS FIRST if the sort is DESCending.

The tokens FIRST and LAST are already reserved keywords in the Derby SQL grammar; the new token NULLS needs to be added as a new non-reserved keyword.

FIRST and LAST, not LOW and HIGH

Note that the standard carefully uses the terms FIRST and LAST, rather than LOW and HIGH. This is intentional, and highlights that the null ordering interacts with the ascending/descending nature of the sort. It may make it more clear to look at this truth table:

ORDER BY says

ASC

DESC

NULLS FIRST

less

greater

NULLS LAST

greater

less

where "less" means "NULL values shall be considered to be less than non-NULL values", and "greater" means "NULL values shall be considered to be greater than non-NULL values.

Interaction of NULLS FIRST/LAST and ASC/DESC

The results below indicate what I believe to be the intended behavior of NULLS FIRST when combined with ASC/DESC. Note that NULLS FIRST means "the NULL values should be the first ones in the result", regardless of whether the result is ordered ASC or DESC, and NULLS LAST behaves correspondingly.

ij> select * from t1;
C1         |C2
-----------------------
NULL       |NULL
1          |1
NULL       |NULL
2          |1
3          |1
10         |10

6 rows selected
ij> select * from t1 order by c1; -- since it was not specified, the implementation chooses NULLS LAST here
C1         |C2
-----------------------
1          |1
2          |1
3          |1
10         |10
NULL       |NULL
NULL       |NULL

6 rows selected
ij> select * from t1 order by c1 NULLS LAST;
C1         |C2
-----------------------
1          |1
2          |1
3          |1
10         |10
NULL       |NULL
NULL       |NULL

6 rows selected
ij> select * from t1 order by c1 nulls first;
C1         |C2
-----------------------
NULL       |NULL
NULL       |NULL
1          |1
2          |1
3          |1
10         |10

6 rows selected
ij> select * from t1 order by c1 desc; -- since it was not specified, the implementation chooses NULLS FIRST here
C1         |C2
-----------------------
NULL       |NULL
NULL       |NULL
10         |10
3          |1
2          |1
1          |1

6 rows selected
ij> select * from t1 order by c1 desc nulls last;
C1         |C2
-----------------------
10         |10
3          |1
2          |1
1          |1
NULL       |NULL
NULL       |NULL

6 rows selected
ij> select * from t1 order by c1 desc nulls first;
C1         |C2
-----------------------
NULL       |NULL
NULL       |NULL
10         |10
3          |1
2          |1
1          |1

6 rows selected

Null Ordering versus Ordered Nulls

Null Ordering and Ordered Nulls are similar, but not identical, concepts. Ordered Nulls terminology is used when we are discussing whether or not we want the Derby engine to use three-valued logic while performing comparison operations. When using three-valued logic for comparison operations, comparison operators always return the special value unknown if either of the two operands is NULL. Thus the Derby comparison operations can either operate in Ordered Nulls mode, or in three-valued logic mode. Null Ordering is only relevant in Ordered Nulls mode, and it further refines Ordered Nulls mode to describe whether the NULL values should be compared lower than non-NULL values or higher than non-NULL values.

Current Implementation

Currently, Derby always sorts NULL values as greater than non-NULL values.

This logic appears to be located in the various subclasses of org.apache.derby.iapi.types.DataType, such as NumberDataType, SQLDate, and SQLChar.

These classes all implement the compare(DataValueDescriptor other) method, and each class contains logic similar to this snippet from SQLDate:

/*
 * thisNull otherNull   return
 *  T       T           0   (this == other)
 *  F       T           -1  (this < other)
 *  T       F           1   (this > other)
 */
if (thisNull || otherNull)
{
    if (!thisNull)      // otherNull must be true
        return -1;
    if (!otherNull)     // thisNull must be true
        return 1;
    return 0;
}

(As an aside, I wonder if there once was a time when NULL values sorted low in Derby, rather than sorting HIGH. The reason I think this might have been true is that the JavaDoc for DataValueDescriptor.compare() is incorrect, and says that null will be treated as less than all other values, and also the code comment in NumberDataType.java is backward. Thirdly, the code in SQLBoolean.java and in XML.java seems backward, which might be an un-noticed bug because the boolean data type is not surfaced to users in current Derby, and the XML type is new and not much used yet?)

Proposed Changes

There are several parts to implementing the <null ordering> feature for Derby:

  • The new syntax must be added to the SQL grammar
  • The null ordering specification must be passed around through the compiler data structures so that it eventually gets passed to the sorter. This involves data structures like OrderByList and OrderByColumn. Eventually, the compiler must set up the correct ColumnOrdering objects, which I think happens in RowOrderingImpl.addOrderedColumn

  • Note that in the compiler, where we are working with the user's SQL statement, we should use the terms "first" and "last", but once the compiler has finished compiling the statement, and is constructing the data structures for execution, we will switch to using the terms "low" and "high".
  • The data structure which bridges the gap between the compiler and the execution code is the IndexColumnOrder class. This class is used both for describing ORDER BY sorts for queries, as well as for describing the sorts used when creating indexes. Since indexes in Derby will always sort nulls high, the index-creation callers of the IndexColumnOrder class can continue to do this unconditionally, while the query compilation callers must specify either low or high depending on what was specified in the query.

  • The sorter must be changed so that in addition to having a columnOrderingMap and a columnOrderingAscendingMap, it also has a columnOrderingNullOrderingMap, where it records whether nulls should be sorted low or high, for each column in the sort.
  • The sorter must pass the user's choice to the DataValueDescriptor.compare() method call.

  • The DataValueDescriptor.compare(DataValueDescriptor other) method must be changed to allow the null ordering choice to be passed in. We want to be careful here, because the passing of an extra argument to the compare() method will affect performance. For example, within an index btree lookup the compare method will be called a lot, and we don't want to be needlessly passing in an argument which will always be false.

  • Also, wherever possible, we want to consolidate common data type code into the common superclass, rather than repeating redundant code throughout the subclasses. One way or another, all of the types which implement the compare method must be changed to use the specified null ordering (possibly by calling a new common method which implements the null ordering decision, then delegates through to the data-type-specific subclass for comparison of non-null values).
  • The code which decides whether a query can use an index scan instead of a sort needs to consider whether the null ordering allows that. This means that in some cases, depending on how the user has specified the null ordering, we cannot use an index scan and must use a sort instead.
  • A variety of new test cases must be added to the test suites, including:
    • There should be tests of ORDER BY clauses with NULLS FIRST specified, insuring that null values indeed come first.
    • There should be tests of ORDER BY clauses with NULLS LAST specified, insuring that null values indeed come last.
    • There should be tests of ORDER BY clauses with no null ordering clause, insuring that null values indeed come last.
    • The tests should include a variety of different data types:
      • numeric data types
      • character data types
      • date/time data types
      • XML data types?
    • There should be tests with queries involving UNION / EXCEPT / INTERSECT operators, ensuring that the NULLS FIRST / NULLS LAST setting is correctly "pushed down" through the set operation.
    • There should be tests that verify that sort avoidance occurs properly:
      • if an ORDER BY clause can be validly satisfied by an index, it should continue to do so
      • if an ORDER BY clause can not be satisfied by an index, due to the user's <null ordering> specification, the optimizer should force a true sort to occur.

  • The existing test cases should all be run to verify that they still pass.

OLAPNullOrdering (last edited 2009-09-20 22:11:52 by localhost)