As part of the OLAPOperations effort, we'd like to add support for ROLLUP multi-level grouping.

Background

Consider a table of sales transaction data, which tracks information about particular sales:

create table sales_history (region varchar(10), state char(2), product varchar(10), date_of_sale timestamp, sales int);

Such a table has one row for each sales transaction, recording the region and state where the sale occurred, the product that was sold, the date on which it was sold, and the amount of the sale.

Now suppose that we'd like to analyze our sales data, to study the amount of sales that is occurring for different products, in different states and regions. Using the ROLLUP feature of SQL 2003, we could issue the query:

select region, state, product, sum(sales) total_sales
from sales_history 
group by rollup (region, state, product)

Semantically, the above query is equivalent to

select region, state, product, sum(sales) total_sales
from sales_history 
group by region, state, product
union
select region, state, null, sum(sales) total_sales
from sales_history 
group by region, state
union
select region, null, null, sum(sales) total_sales
from sales_history 
group by region
union
select null, null, null, sum(sales) total_sales
from sales_history 

The query might produce results that looked something like:

REGION    STATE    PRODUCT    TOTAL_SALES
------    -----    -------    -----------
null       null     null        6200
EAST       MA       BOATS       100
EAST       MA       CARS        1500
EAST       MA       null        1600
EAST       NY       BOATS       150
EAST       NY       CARS        1000
EAST       NY       null        1150
EAST       null     null        2750
WEST       CA       BOATS       750
WEST       CA       CARS        500
WEST       CA       null        1250
WEST       AZ       BOATS       2000
WEST       AZ       CARS        200
WEST       AZ       null        2200
WEST       null     null        3450

We could actually compute the ROLLUP query by transforming it into the equivalent UNION query. However, we should seek to find an implementation in which the grouped aggregates can be computed using a minimum number of sorts of the data, and a minimum number of passes over the input data. This is discussed in more detail later on this page, but our goal should be that the expense of computing this query is roughly equivalent to the expense of computing the single sub-query

select region, state, product, sum(sales) total_sales
from sales_history 
group by region, state, product

because the other aggregation levels can be computed at the same time.

Syntax

In the SQL 2003 standard, the ROLLUP keyword impacts the following portions of the SQL grammar:

<grouping element> ::=
    <ordinary grouping set>
  | <rollup list>
  | <cube list>
  | <grouping sets specification>
  | <empty grouping set>

<rollup list> ::=
    ROLLUP <left paren> <ordinary grouping set list> <right paren>

<grouping set> ::=
    <ordinary grouping set>
  | <rollup list>
  | <cube list>
  | <grouping sets specification>
  | <empty grouping set>

The above grammar is just a tiny subset of the overall GROUP BY grammar, namely those parts that involve ROLLUP.

Current GROUP BY implementation

The current implementation of GROUP BY aggregation is (over-simplified, but hopefully accurate):

The grouped aggregate processing occurs in GroupedAggregateResultSet.java

Techniques for Implementing Grouped Aggregates

It seems that there are basically 3 ways to perform grouped aggregation:

  1. Inline Aggregation
  2. Duplicate-eliminating sort with sort-observer aggregation
  3. Duplicate-eliminating hash with hash-collision aggregation

There also appear to be numerous ways to combine these three variants, which we refer to as "hybrid" algorithms below.

Inline Aggregation

If the records to be grouped are already in sorted order, perhaps due to a scan of an index, or a previous merge-sort-join operation, or a duplicate-eliminating distinct sort, then the grouping operation can be implemented by comparing each pair of records on the grouping attributes; each time a new value of the grouping attributes is encountered, the computation of the previous group can be finalized and emitted.

This algorithm is currently implemented by Derby for a single set of grouping attributes; it seems possible to extend this logic for a ROLLUP set of groups, as follows:

Duplicate-eliminating Sort with Sort Observer

An unsorted set of records can be grouped by feeding them into the sorter, configured for duplicate elimination, with a sort observer which aggregates each row's values into the retained row's columns during duplicate processing. Each such sort groups the records by a specific set of grouping attributes.

Thus we:

Thus aggregate computation is a side-effect of sorting and duplicate elimination. Note that as a side effect of using the sorter for this purpose, the result set is emitted in ORDER BY the grouping attributes.

This algorithm is currently implemented by Derby for a single set of grouping attributes; it seems possible to extend this logic for implementing a ROLLUP set of groups, as follows:

This should be substantially more efficient than the alternate query-rewriting algorithm of constructing N GROUP BY statements and then UNION'ing their results together, since in this implementation the input rows are materialized and read only once.

However, if multiple sorts are used to compute multiple grouping sets, whether ROLLUP or unrelated, the overall result set may not necessarily be emitted in ORDER BY the grouping attributes.

This algorithm also seems like it could be extended to implement other types of GROUPING SETs and CUBE aggregation, since each sort can independently group the records by a different set of unrelated grouping attributes.

Hash-based Grouped Aggregation

An unsorted set of records can also be grouped by feeding them into a hash table, whose keys are the grouping attributes, with a hash-collision observer which aggregates each row's values into the retained row's columns during duplicate key elimination. This is fundamentally similar to the sort-based technique, with the important difference that at the end of the processing, the rows will NOT be emitted in sorted order.

I am not aware of any code which implements a hash-based technique for grouped aggregates in Derby.

Hybrid Algorithms

The inline-aggreation and sort-based techniques can be combined into an algorithm which computes the N ROLLUP groups using a single sort, by first sorting (and aggregating) the rows on their finest level of grouping detail.

All we have to do is to extend GroupedAggregateResultSet so that instead of computing a single result row at a time, it maintains N pending result rows, where N is the number of levels of ROLLUP in the query (N == 4 in the query above).

So, when GroupedAggregateResultSet is processing a row of data for (region=EAST,state=MA,product=BOATS), it should not only sum the sales for that level of aggregation, but should also at the same time sum the sales for the aggregation levels (region=EAST,state=MA), and (region=EAST), and ().

Then, when GroupedAggregateResultSet encounters the end of the (EAST,MA,BOATS) data, and starts to encounter (EAST,MA,CARS) data, it should finalize the result row for (EAST,MA,BOATS) and start working on computing the result row for (EAST,MA,CARS), but it should simply continue aggregating the result rows for (EAST,MA), (EAST), and ().

And when GroupedAggregateResultSet encounters the end of the (EAST,MA,CARS) data, and starts to encounter (EAST,NY,BOATS) data,it should finalize the result row for (EAST,MA,CARS), and it should also finalize the result row for (EAST,MA), and it should start working on computing the result row for (EAST,NY,BOATS), but it should continue aggregating the result rows for (EAST) and for ().

OLAPRollupLists (last edited 2009-09-20 22:11:48 by localhost)