Differences between revisions 1 and 2
Revision 1 as of 2006-10-31 17:49:50
Size: 14416
Editor: Army
Comment:
Revision 2 as of 2009-09-20 22:12:48
Size: 14422
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
As described on the [http://wiki.apache.org/db-derby/LanguageOptimize Optimizer Overview] page, there are three primary methods in Derby's !OptimizerImpl class that encapsulate the general process of trying to find the "best join order" for a given !OptimizerImpl's list of Optimizables. Those three methods are: As described on the [[http://wiki.apache.org/db-derby/LanguageOptimize|Optimizer Overview]] page, there are three primary methods in Derby's !OptimizerImpl class that encapsulate the general process of trying to find the "best join order" for a given !OptimizerImpl's list of Optimizables. Those three methods are:
Line 158: Line 158:
What all of this means is that there is no central place in the Derby code where all calculation of estimated cost occurs. Just as every result set node in the tree has to define its own "preprocess()" and "modifyAccessPaths()" methods (see [http://wiki.apache.org/db-derby/LanguageOptimize Optimizer Overview]), every result set has to define--either directly or through inheritance--its own "optimizeIt()" or "optimize()" method, as well. Depending on the type of result set in question, the amount and kind of work that is done to estimate the cost can vary greatly. What all of this means is that there is no central place in the Derby code where all calculation of estimated cost occurs. Just as every result set node in the tree has to define its own "preprocess()" and "modifyAccessPaths()" methods (see [[http://wiki.apache.org/db-derby/LanguageOptimize|Optimizer Overview]]), every result set has to define--either directly or through inheritance--its own "optimizeIt()" or "optimize()" method, as well. Depending on the type of result set in question, the amount and kind of work that is done to estimate the cost can vary greatly.
Line 189: Line 189:
Having completed this call, "optimizable"--which is our MRP-Optimizable--now knows what the estimated cost for its current access path is. We then return from the {{{ optimizer.costPermutation() }}} method shown in the "while" loop at the top of this page, and continue processing as described [http://wiki.apache.org/db-derby/LanguageOptimize here]. Having completed this call, "optimizable"--which is our MRP-Optimizable--now knows what the estimated cost for its current access path is. We then return from the {{{ optimizer.costPermutation() }}} method shown in the "while" loop at the top of this page, and continue processing as described [[http://wiki.apache.org/db-derby/LanguageOptimize|here]].

As described on the Optimizer Overview page, there are three primary methods in Derby's OptimizerImpl class that encapsulate the general process of trying to find the "best join order" for a given OptimizerImpl's list of Optimizables. Those three methods are:

  • - getNextPermutation() - getNextDecoratedPermutation() - costPermutation()

This page describes the general flow of code as it occurs when a class such as SelectNode or TableOperatorNode calls the costPermutation() method of an OptimizerImpl. The expectation is that such a call will occur as part of a block of code similar to the following (copied from SelectNode.optimize()):

/* Optimize this SelectNode */
while (optimizer.getNextPermutation())
{
    while (optimizer.getNextDecoratedPermutation())
    {
        optimizer.costPermutation();
    }
}

Thus when we reach  optimizer.costPermutation()  we know that:

  • OptimizerImpl has an array of Optimizables representing a (potentially incomplete) join order,

  • an Optimizable has just been placed at some position in the join order array, and
  • a specific "decoration", or access path, has been chosen for the most-recently-placed Optimizable (hereafter referred to as "MRP-Optimizable").

At this point, then, what we want to do is estimate the cost of having MRP-Optimizable at its current position in the join order with the current choice of join strategy and, if applicable, the current choice of index/heap.

A look at the code in  OptimizerImpl.costPermutation()  shows that it does three things. First it figures out what the "outerCost" for MRP-Optimizable should be. Then it checks to see if the current choice of join strategy is feasible for MRP-Optimizable. And finally, it makes a call to the "optimizeIt()" method of MRP-Optimizable, which is where the cost estimate is calculated.

Finding the "outer cost"

To start with, the "outer cost" for a given MRP-Optimizable is implicitly defined to be the cost of the left side of the join for which MRP-Optimizable is the right side. This follows from the fact that Derby only considers left-deep trees when estimating costs. So if, for example, we have a join order that is {T2, T1, T4} and T4 is our MRP-Optimizable, the join tree looks like:

      JOIN[0]
     /      \
   JOIN[1]   T4
  /    \
 T2     T1

and the "outer cost" for T4 is simply the cost of JOIN[1]. In addition to this definition of "outer cost", the code in OptimizerImpl implicitly defines the "estimated cost" of an Optimizable Opt[n] as follows:

Let "n" be a zero-based position in the join order and let Opt[n] be the Optimizable that is placed at position "n". The "estimated cost" of Opt[n] is:

  • if  (n == 0)  then it is simply the cost of accessing the rows in Opt[n]

  • else, it is the cost of accessing the rows in Opt[n] and joining them with the rows from Opt[n-1] using whatever join strategy is chosen as part of the "access path" for Opt[n].

So to continue our example, the "estimated cost" of T1 is meant to describe the cost of joining T2 with T1 using whatever join strategy is chosen as part of T1's access path. Similarly, the estimated cost of T4 is the cost of joining T4 with the result of the join between T2 and T1, using whatever join strategy is chosen as part of T4's access path.

What this means is that, according the definitions of "outer cost" and "estimated cost", the "outer cost" for a given Optimizable is simply the cost of the preceding Optimizable in the join order, or 1.0d if the Optimizable is the first in the join order (i.e.  n == 0 ). Thus the outer cost of T2 is 1.0d, the outer cost of T1 is the estimated cost of T2, and the outer cost of T4 is the estimated cost of JOIN[1], which is really just the estimated cost of T1 (according to the definition of "estimated cost").

In the actual code for OptimizerImpl.costPermutation(), this logic is encapsulated by the following:

/*
** Get the cost of the outer plan so far.  This gives us the current
** estimated rows, ordering, etc.
*/
CostEstimate outerCost;
if (joinPosition == 0)
{
    outerCost = outermostCostEstimate;
}
else
{
    outerCost =
        optimizableList.getOptimizable(
            proposedJoinOrder[joinPosition - 1]).
                getBestAccessPath().getCostEstimate();
}

Join Strategy Feasibility

After determining the outer cost for MRP-Optimizable, the second thing OptimizerImpl does as part of its "costPermutation()" method is check to see that the join strategy for the current access path is feasible:

/*
** Don't consider non-feasible join strategies.
*/
if ( ! optimizable.feasibleJoinStrategy(predicateList, this))
{
    return;
}

By this time "optimizable", which is our MRP-Optimizable, has some associated access path for which we want to estimate the cost, and that access path includes a choice of join strategy. Derby currently only supports two join strategies, nested loop and hash, represented by the classes NestedLoopJoinStrategy.java and HashJoinStrategy.java, respectively.

The call to  optimizable.feasibleJoinStrategy()  will ultimately result in a call to the "feasible()" method of the join strategy that is part of MRP-Optimizable's current access path. The JoinStrategy class then determines whether or not the join that it represents is feasible for MRP-Optimizable. As a general rule, a nested loop join is always feasible (though there are some corner-cases; see the code comments in  NestedLoopJoinStrategy.feasible() ). A hash join is only feasible if there exists an equijoin predicate between MRP-Optimizable and some other Optimizable that precedes MRP-Optimizable in the join order. See HashJoinsInDerby for more details on hash join processing during Derby optimization.

Calculating a Cost Estimate

After finding the "outer cost" for MRP-Optimizable and verifying that the current join strategy is feasible, OptimizerImpl then makes the call to estimate the cost of MRP-Optimizable at its current position in the join order with its current access path ("decoration"):

/* Cost the optimizable at the current join position */
optimizable.optimizeIt(this,
                       predicateList,
                       outerCost,
                       currentRowOrdering);

This is the last line of the OptimizerImpl.costPermutation() method.

The thing to note here is that OptimizerImpl does not directly compute the estimated cost of the Optimizable. Instead, the Optimizable is responsible for calculating its *own* estimated cost based on the information passed in from the OptimizerImpl to which the it belongs (i.e. using the arguments passed from the code shown above). An OptimizerImpl then takes the estimated cost calculated by the Optimizable and uses that estimate to calculate the overall estimated cost for the join order in question. That said, it is worth mentioning that the above code does not retrieve or save the actual cost estimate anywhere. This is because the Optimizable is also responsible for saving its own estimated cost, along with saving all information that describes the access path corresponding to the estimated cost. An OptimizerImpl may ask for the estimated cost or access path information for a specific Optimizable, but that information is *not* stored within the OptimizerImpl. This is why we see things like:

outerCost =
    optimizableList.getOptimizable(
        proposedJoinOrder[joinPosition - 1]).
            getBestAccessPath().getCostEstimate();

in many of the OptimizerImpl's methods. In this example the OptimizerImpl wants to know what the best cost estimate for a specific Optimizable is. In order to get that information the OptimizerImpl has to retrieve the target Optimizable from its list and specifically "ask" that Optimizable for the estimated cost of its best access path so far. That's what the above code snippet does.

Okay, so backing up a step, we said that an Optimizable is responsible for calculating its own estimated cost. As part of this responsibility the Optimizable must also ensure that it makes the necessary call(s) to its child(ren) result set(s) so that they, in turn, can calculate their estimated costs, as well. Depending on the type of of the result set, this usually means that the Optimizable will call one of "optimize()" or "optimizeIt()" on every child result set that it has. This effectively means that a call to  optimizable.optimizeIt()  from OptimizerImpl will yield a cost estimate for the entire subtree whose root is the Optimizable in question (i.e. MRP-Optimizable).

As an example, assume MRP-Optimizable is a UNION between two SELECT queries. Then as part of its "optimizeIt()" method the UNION must make a call to the "optimize()" or "optimizeIt()" method of both its left child AND its right child. The UNION will then calculate its own estimated cost based on the costs of its children, and that final cost is what the UnionNode will save for its current access path. We can get a glimpse of how this happens by looking at the following code snippet, pulled from the  UnionNode.optimizeIt()  method:

leftResultSet = optimizeSource(
                    optimizer,
                    leftResultSet,
                    getLeftOptPredicateList(),
                    outerCost);

rightResultSet = optimizeSource(
                    optimizer,
                    rightResultSet,
                    getRightOptPredicateList(),
                    outerCost);

CostEstimate costEstimate = getCostEstimate(optimizer);

/* The cost is the sum of the two child costs */
costEstimate.setCost(leftResultSet.getCostEstimate().getEstimatedCost(),
                     leftResultSet.getCostEstimate().rowCount(),
                     leftResultSet.getCostEstimate().singleScanRowCount() +
                     rightResultSet.getCostEstimate().singleScanRowCount());

costEstimate.add(rightResultSet.costEstimate, costEstimate);

In this case the "optimizeSource()" method comes from UnionNode's parent class, TableOperatorNode, in which we see one of two things, depending on the class of child nodes in question. For a child that is itself an Optimizable, the "optimizeSource()" method creates a new instance of OptimizerImpl and enters a loop that is identical to the while loop shown at the top of this page, thereby leading to recursive optimization. For a child that is not an Optimizable the "optimizeSource()" method makes the following call:

retval = sourceResultSet.optimize(
                            optimizer.getDataDictionary(),
                            predList,
                            outerCost.rowCount());

and thereby fulfills UnionNode's job of calling "optimize()" or "optimizeIt()" on every child.

What all of this means is that there is no central place in the Derby code where all calculation of estimated cost occurs. Just as every result set node in the tree has to define its own "preprocess()" and "modifyAccessPaths()" methods (see Optimizer Overview), every result set has to define--either directly or through inheritance--its own "optimizeIt()" or "optimize()" method, as well. Depending on the type of result set in question, the amount and kind of work that is done to estimate the cost can vary greatly.

That said, though, the "leaves" of most query trees are most often made up of base tables. For this reason it is worth looking more at the code flow that occurs for a base table in Derby, which is represented by an instance of FromBaseTable. In  FromBaseTable.optimizeIt()  we see the following:

optimizer.costOptimizable(
            this,
            tableDescriptor,
            getCurrentAccessPath().getConglomerateDescriptor(),
            predList,
            outerCost);

In other words, a FromBaseTable actually calls back to the OptimizerImpl to which it belongs. At first glance this may seem to contradict the notion of every Optimizable calculating its own estimated cost. If we keep going, though, we see that the "costOptimizable()" method of OptimizerImpl goes through several layers of method calls and ultimately ends up at the private method "costBasedCostOptimizable()" in the same class. In that method the OptimizerImpl does two things:

  1. it first calls "estimateTotalCost()", which in turn calls the "estimateCost()" method of the Optimizable in question--i.e. it calls *back* to the FromBaseTable for calculation of the estimated cost. In this way the calculation is again left up to the Optimizable, as expected.

  2. it then determines if the result of the call to "estimateCost()" on the Optimizable is the best one for that Optimizable so far. If so, the OptimizerImpl will tell the Optimizable to save the current cost (and the associated access path) as the best so far.

Given this code flow, it is not surprising that for most queries, the bulk of the low-level costing information and logic for Derby originates in the  FromBaseTable.estimateCost()  method. The cost estimates returned from that method are then passed back up the query tree to the various result set nodes that rely on it. So in the case of a UNION with two SELECT queries, the cost of each SELECT query is built on the estimated costs returned for the base tables in each query; the cost of the UNION is then build on the costs of the SELECT queries.

Eventually we end up back where we started in OptimizerImpl--namely, back at:

/* Cost the optimizable at the current join position */
optimizable.optimizeIt(this,
                       predicateList,
                       outerCost,
                       currentRowOrdering);

Having completed this call, "optimizable"--which is our MRP-Optimizable--now knows what the estimated cost for its current access path is. We then return from the  optimizer.costPermutation()  method shown in the "while" loop at the top of this page, and continue processing as described here.

PermutationCosting (last edited 2009-09-20 22:12:48 by localhost)