This page describes a small enhancement to Trafodion: Adding support for a regular expression pattern matching predicate. As the enhancement moves forward, we'll document here what we did, and use it to illuminate lower level design aspects. Note: We did this work on a side branch; none of it has been submitted to the master branch as of this writing.

This particular enhancement is to the Trafodion Compiler and Executor, so design illustrations are limited to those components.

 

General Approach

Aside from the fact that pattern matching is a useful feature in its own right, we have two goals for this particular enhancement:

  • To illustrate the lower level design of parts of the Trafodion Compiler and Executor.
  • To illustrate how a feature can be implemented incrementally.

The second point bears a little discussion. The open source development model we use for Trafodion is reviewer-centric: Anyone can contribute changes, but in order for them to be merged into the master thread the changes must pass code review. Humans can review code much more quickly if the set of changes are small and crisp in their functionality. A small set of changes are much more likely to be integrated into the master thread, and much less likely to encounter merge issues along the way. The reader will see as we discuss internal design below that we have put some thought into how this particular feature can be broken up into a series of small changes.

In developing this feature we took the following steps.

  • We created a Launchpad blueprint to track the feature.
  • We took a first look at externals, observing how other products support regular expression matching, and how the ANSI standard supports it. It turns out that the ANSI standard is not followed in the industry; we chose a syntax that is more like the industry.
  • We looked for analogies in the existing Trafodion code that are similar to pattern matching. Fortunately, there is one that is very much like it: the LIKE predicate.
  • We researched how LIKE is implemented at a high level. That is, we studied the classes that implement LIKE, and where they are invoked in Compiler and Executor processing. As a first cut, we imagined implementing regular expressions similarly. We also imagined breaking up our implementation by passes, for example one change set for parser changes, another for binder changes and so on.
  • We read the LIKE code in more detail. This led to knowledge of various subtleties that made us study regular expressions externals more closely, to understand how they matched and how they differed. It also led us to understand that quite a bit of the LIKE implementation also applied to regular expressions, which made us consider refactoring the classes that implement LIKE. Finally, it opened up to us additional ways to break up the changes into smaller pieces: For example, certain compile time optimizations could be deferred to a separate set of changes.
  • We documented the regular expression pattern matching externals.
  • We began making small sets of code changes. With each code change, we unit tested, and ran regression tests. If we were actively developing this on the Master branch, we would have contributed each small set for review.

Launchpad Blueprint

The Launchpad Blueprint under which this work is tracked is at https://blueprints.launchpad.net/trafodion/+spec/regular-expression-predicate.

Externals

To decide on externals we looked at several possibilities.

  • The SIMILAR predicate was introduced into the ANSI SQL standard in its 1999 version. There is a good informal description in the book "SQL 1999 Understanding Relational Language Components" by Jim Melton and Alan R. Simon. A survey of competing database engines could find no-one who had implemented the ANSI syntax, however.
  • Oracle supports a suite of regular expression functions, starting in 10g. REGEXP_LIKE is a regular expression matching predicate. Other functions count the number of occurrences of a regular expression pattern in a given string, return the position of the first one, or replace a substring matching some pattern with another substring.
  • MySQL supports a REGEXP predicate, which matches a regular expression to a string.
  • SQL Server does not have direct support; however one can call a function in the CLR from SQL syntax to get regular expression pattern matching.
  • DB2 does not have direct support either. IBM's documentation will show one how to create a user-defined function that invokes a regular expression pattern matcher of the user's choice.

Except for SIMILAR, all of the above use regular expression matching that conforms to the POSIX specifications. Since that seems to be the trend in the industry, we decided to go that way too.

Another consideration is the difficulty of a run-time implementation. As described below, we chose the RE2 regular expression engine for our run-time semantics. This implements an extension to the POSIX regular expression syntax, and so determines the precise syntax of regular expressions that we will support.

Based on the discussion above, we chose "REGEXP" as the spelling of our pattern-matching predicate.

The reader who is more interested in Compiler and Executor design rather than the REGEXP predicate itself may skip this section and go directly to Internals.

Syntax

The syntax consists of two parts: the REGEXP predicate itself and the regular expression used in pattern matching.

The REGEXP Predicate

Here is the syntax for the REGEXP predicate:

<regular expression matching predicate> ::= 
 <character match value> [ NOT ] REGEXP <regular expression pattern> 

Here, <character match value> is a <character value expression> representing the value to be matched. <regular expression pattern> is a <character value expression> representing the regular expression to match the value against. One departure we make from the LIKE predicate is we do not supply an ESCAPE clause. ANSI LIKE by default does not have an escape character; LIKE provides the user the ability to specify an optional escape character of their own choosing as needed. POSIX regular expressions on the other hand uses '\' as an escape character always, and does not provide the user a means to override that choice.

In ANSI SQL, a <character value expression> is any expression that yields a character value. So it could be a literal constant, a column reference, a dynamic parameter, or the result of any sequence of function calls and operators that yield a character value. Since the values of these expressions are not in general known at compile time, many of the error checks are sometimes deferred to run time.

The ANSI SQL syntax rules for LIKE require that all three expressions are "comparable". We will do the same for REGEXP. For our purposes this means that they must be of the same character set.

ANSI SQL supports a notion of collating sequences. One could imagine using a case-insensitive collating sequence with REGEXP, getting case-insensitive pattern matching as a result. However, collating sequence support in general turns out to be complicated. To limit the scope of this work, we'll restrict ourselves to the default collating sequence, and will raise a compile time error if a non-default collating sequence is specified or implied.

The Regular Expression

We will use the RE2 package unaltered. The syntax of regular expressions supported by RE2 is available on Google's re2 syntax page.

If the value given for the regular expression is invalid, we will raise the following error:

*** ERROR[8442] Invalid regular expression: <message text from RE2>

Here, the <message text from RE2> gives more specifics. For example, '*a' is an invalid regular expression, because the *-operator is a post-fix operator. In this case, the error message returned is:

*** ERROR[8442] Invalid regular expression: no argument for repetition operator: *

Run Time Semantics

If either operand to REGEXP is a null value, the REGEXP predicate will return null. This is similar to the LIKE predicate behavior.

Assuming both operands are not null, and the regular expression specified is valid, the predicate will return true if the value specified matches the regular expression in its entirety, and false otherwise.

Internals

In this section, we give an overview of Compiler and Executor internals, and describe how we implement the REGEXP predicate within these internals.

Query Compilation and Execution Flow

A query in Trafodion passes through several phases:

  1. Compilation
  2. Fix-up
  3. Execution
  4. Awaiting clean-up or reuse

Compilation

During compilation, a query is transformed from query text into various intermediate trees then finally into a query plan. As the query is parsed, it is represented as a syntax tree. This tree is then bound to various metadata, then transformed into a tree suitable for optimization. Optimization explores a query plan space by manipulating this tree further, making partial copies of the tree for each explored plan. Finally, the chosen plan is generated. During compilation one set of C++ classes represents the nodes in the query tree. At code generation, the query plan is transformed into a different representation in a different set of C++ classes. We discuss some of these below.

Fix-up

As mentioned, Trafodion query plans consist of C++ objects generated by the Trafodion Compiler. These objects are organized into fragments, each fragment representing a piece of the query plan that runs in a particular process. Since query plans run in several processes, Trafodion must serialize and deserialize these objects so they can be communicated across process boundaries.

At query execution time, the initial phase of processing is called fix-up. During fix-up, the Trafodion Executor acquires any processes it needs and downloads the query plan fragments to each. Once downloaded, the fragment is deserialized, and additional run-time objects are created. Any files that need to be opened are opened.

Execution

Execution begins by passing an input row consisting of input parameters to the root node of the query. This row then flows down the tree of relational operators with appropriate transformations along the way. Eventually the inputs reach the leaf nodes of the query. For a SELECT these would be scan nodes. Those nodes in turn read rows that are passed up the tree. Each operator informs its parent operator when it has returned all rows satisfying a given input.

A query ends execution in one of three ways. It may end because all the output rows have been processed. It may end because an error was detected during execution. It may end early because the client has closed the output cursor before consuming all the output rows, or because the query has been cancelled. Since operators execute asynchronously and communicate via queues, error conditions or early close/cancel will cause a cancellation indication to be passed down all or part of the query tree. Eventually, though, all operators quiesce.

Awaiting Clean-up or Reuse

Queries can be explicitly cleaned up (for example, by reusing the statement name when preparing a query). If they are not, the query plan remains available in fixed-up state on the chance that the query is re-executed.

ExprNode, RelExpr and ItemExpr Class Hierarchies

Within the Compiler, all SQL expressions are represented by an object in the ExprNode class hierarchy. These can be subdivided into relational expressions (having a value of a relation) and item expressions (having a scalar or row value). This subdivision in turn leads to two sub-trees of the ExprNode hierarchy, the RelExpr and ItemExpr trees respectively.

The REGEXP predicate has a scalar value, so for our purposes here we are most interested in the ItemExpr hierarchy. Given that REGEXP bears many similarities to LIKE, it is worth looking at how LIKE is represented in this hierarchy. The picture below describes it:

Like Class Hierarchy

 

The Function class is a specialization of ItemExpr, representing a function call. This might be a built-in function or a user-defined function. Built-in functions in turn are represented by the BuiltinFunction specialization of Function. The CacheableBuiltinFunction class is a further specialization for those functions that are "cacheable" in the sense of the compiler query cache.

Of course, methods that are specific to the LIKE predicate are implemented in the Like class.

The code for Like is spread over several files. Code in the Compiler tends to be arranged by the kind of processing the methods do. So, for example, all the ItemExpr hierarchy methods dealing with type synthesis are together in one file. The table below gives an overview of where some of the common ItemExpr class methods (including the Like representatives) live:

MethodFileFile Description
bindNodesql/optimizer/BindItemExpr.cppbindNode methods bind column and table references to SQL metadata
normalizeForCachesql/optimizer/ItemCache.cppdeals with cached query plans
constructor, copyTopNodesql/optimizer/ItemExpr.cppall the constructors for ItemExpr classes are here; also copyTopNode, used to duplicate the top node of an ItemExpr tree (sharing the children below it)
synthesizeTypesql/optimizer/SynthType.cppsynthesizeType methods synthesize the type of expressions, based on the node and its children

In our design research, we found that many methods on the Like class also apply to the REGEXP predicate. Therefore, we propose some refactoring. As part of our implementation, we will create a new class Regexp, in parallel with Like. We will also create a new class, PatternMatchingPredicate, which inherits from CacheableBuiltinFunction, and has Like and Regexp as subclasses. Some of the existing methods of the Like class will be pushed up to PatternMatchingPredicate so we can use them for Regexp as well.

In our implementation approach, we wanted to implement code by pass. That requires an understanding of which pass methods belong to. Some can be guessed simply from the method name ("bindNode" suggests Binder pass, and knowing that the Binder pass does type synthesis suggests that "synthesizeType" is also a Binder pass method). To be more certain, though, we used the following technique to figure out when methods are called:

  1. We started a SQL session, then attached gdb to the mxosrvr process.
  2. We put break points in for every non-inlined method defined in the Like class.
  3. We executed a query with a LIKE predicate, and noted the stack trace at each method call.

In the stack trace, we look for the stack frame for the CmpMain::compile method. This is the method that drives the passes. Using gdb, we could look at the CmpMain::compile code. Comments around the code reveal what pass is being invoked.

One thing to note in this exercise is that some methods are called in multiple passes. This happens because the Compiler sometimes rewrites expressions during its processing. When it creates a new parse sub-tree, it must bind it, normalize it, and so on. So, if a portion of the tree is rewritten due to an Optimizer transformation, for example, one can expect to see bind and normalize methods executed on that sub-tree even though CmpMain::compile shows that we are in the Optimizer pass.

Quite often in these rewrites, the new sub-tree will contain pieces of the old parse tree. Binding, normalizing, etc. are recursive processes. This means that sometimes an old parse node will be revisited (say, for binding). For this reason, you will often see logic in these methods that makes them idempotent; for example, the bindNode methods usually check to see if the node has already been bound before proceeding.

Run-time Class Hierarchies

The ExprNode class hierarchy is compile-time only. The compiler translates the query tree into a query plan. The main class hierarchies in the query plan are the ex_tcb hierarchy (task control blocks), and the ex_clause hierarchy (scalar expression operators). There is a rough correspondence between RelExpr classes and ex_tcb classes, and also between ItemExpr classes and ex_clause classes.

Unlike RelExpr and ItemExpr, however, ex_tcb and ex_clause do not share a unique base class. (They do both ultimately inherit from NABasicObject, but other hierarchies also inherit from NABasicObject.) The reason is that the query plan implementation of relational operators is quite different from scalar operators. Relational operators implement a data flow architecture, and represent tasks that are executed asynchronously at run-time. Relational operators communicate with each other via queues. Scalar expression operators on the other hand are invoked in a single-threaded manner by an expression evaluation engine. Relational operators are arranged in trees in general, while scalar operators are arranged as sequences within expressions.

ex_tcb, ex_tdb Class Hierarchies

The ex_tcb class hierarchy is used to represent compile-time state of relational operators. The query plan is represented as a tree of ex_tcb objects. Each node in the plan is some leaf in this class hierarchy which represents the semantics of a particular operator. At fix-up time, Trafodion creates a parallel hierarchy of ex_tdb objects which contain the run-time state of particular operators. For most operators, there is a one-to-one correspondence between an ex_tcb object and an ex_tdb object. Exceptions are operators that deal with parallelized inter-process communication. For example, the "split bottom" operator combines n partitions of a stream of data from lower level fragments into one stream in the current, parent fragment. There is one ex_tdb object for each of the partitions, but one ex_tcb object for the operator as a whole.

ex_clause Class Hierarchy

The ex_clause class is the base class for all individual scalar operations. A scalar expression consists of a sequence of objects from the ex_clause hierarchy. Both compile time and run-time state may be stored within an ex_clause object.

For the LIKE predicate, there is a tree of subclasses. ex_like_clause_base is the base class for LIKE. ex_like_clause_char is the specialization that implements LIKE for the ISO88591 and UTF8 character sets. ex_like_clause_double_byte is the specialization that implements LIKE for the UCS2 double-byte character set.

There does not seem to be any advantage in using ex_like_clause_base as a base class for REGEXP also, nor does there seem to be a useful run-time abstraction common to LIKE and REGEXP beyond that already represented by ex_clause. Too, we decided not to do a run-time implementation for double-byte character set values; instead we decided to translate these values to UTF8 in the query plan. We will create a class hierarchy ExRegexpClauseBase that parallels the ex_like_clause_base class, but lacking a double-byte leaf class.

Memory Management

The Compiler and Executor use an optimized memory management scheme. Both create many objects per SQL statement. We sometimes avoid the overhead of calling the destructors on most of these objects by placing them in customized heaps, which we simply throw away en masse.

Some care must be taken in using this scheme. One must be careful to use the same heap when deleting an object as when one created it. For most Compiler and Executor classes, this has been simplified by defining these classes as inheriting from NABasicObject. NABasicObject overloads new and delete, and keeps a pointer to the heap that an object was allocated on. Overloaded delete uses this pointer in order to free the object storage to the proper heap. But much of the time overloaded delete is never called; instead the heap itself is destroyed.

For Executor classes in particular, these heaps might be destroyed at any time without knowledge within the classes themselves.

This implies that any resources outside the heap that are owned by a class must be handled specially. File opens, for example, are tracked by a separate layer of code that is not based on the heap. That code can detect when a query plan has gone away and close files in a lazy fashion as needed.

For another example, if we have a class that does not inherit from NABasicObject, we must be very careful about using such a class from an Executor class to avoid memory and other resource leaks. In our example, we are using a class RE2 from the RE2 package. If we create this object as a stack variable, we are assured it will be destroyed when it goes out of scope. On the other hand, if we wish to create this object on the C++ heap, we must track its creation outside the plan fragment, and have the agent that destroys the plan fragment's heap destroy this object also.

Error Messages

Our implementation will result in some new error checks. Also, in doing the work incrementally from the parser forward, we will want to issue a stub error message at the point where our changes stop. So, here is some general information about error message generation and handling in the Compiler and Executor.

The Compiler and Executor both use class ComDiagsArea to manage error information. The execution model in both is that when an error occurs, information about the error is inserted into ComDiagsArea, and the calling method typically returns. Later processing often checks to see if any errors have been raised before proceeding.

Creating an Error

To create an error, one uses the << operator to insert a stream of DgBase objects into a ComDiagsArea object. DgBase is a class hierarchy; the leaf classes define different tokens associated with the error. The first DgBase object in the stream is a DgSqlCode object which gives the (integer) SQL error code associated with the error. The remaining DgBase objects vary depending on what SQL error is raised. For example, an error complaining that a column does not exist in a given table would have a DgColumnName object naming the column, and a DgTableName object naming the table. Below is example code creating an error taken from sql/optimizer/BindItemExpr.cpp:

     *CmpCommon::diags() << DgSqlCode(-4002)
       << DgColumnName(nam)
       << DgTableName(getCorrNameObj().getExposedNameAsAnsiString())
       << DgString0(fmtdList)
       << DgString1(bindWA->getDefaultSchema().getSchemaNameAsAnsiString());
     bindWA->setErrStatus();

Things to note in this code: The Compiler, being in a single process, typically uses one ComDiagsArea object for all its errors. The CmpCommon class has a static method, diags(), that returns a reference to this object. Another thing to note is the setErrStatus call after the error is created. The object referenced here is the Binder work area. Each pass of the Compiler has a work area object, which serves as a set of "globals" for that pass. The Binder keeps track of whether any errors have been raised in its work area.

In this example, a constant was hard-coded for the SQL error code. The Compiler presently lacks an enumeration for SQL error codes. The Executor on the other hand does have one; you can find it in sql/exp/ExpErrorEnum.h. (In another accident of history, this file is in the exp (expressions) directory, but it includes all Executor errors.)

The ComDiagsArea class definition and the declaration of the << operator are in file sql/export/ComDiagsArea.h. One interesting thing to note about ComDiagsArea is that it inherits from IpcMessageObj. This is because we sometimes pass error information across process boundaries. The DgBase class hierarchy is defined in file sql/common/DgBaseType.h.

Defining the Text for an Error

SQL errors are usually presented to end users in text form. The code that translates error diagnostics to text form uses a template file found in sql/bin/SqlciErrors.txt. There is a line in this file for each SQL error code. The format of each line is:

<SQL error code> <SQL state> <SQL sub-state> <BEGINNER/ADVANCED> <severity> <reporting mechanism> <message text>

An example is given below:

1009 ZZZZZ 99999 BEGINNER MINOR DBADMIN Column $0~ColumnName does not exist in the specified table.

In the text part of the example, notice the string "$0~ColumnName". The first DgColumnName object inserted into the error stream will have its value substituted for this token.

By convention, ranges of SQL error codes have been dedicated to certain components.

  • 1000-1999 are reserved for DDL errors.
  • 2000-2999 are reserved for Compiler main errors.
  • 3000-3999 are reserved for Parser errors.
  • 4000-4999 are reserved for Binder errors.
  • 5000-5999 are reserved for Normalizer errors.
  • 6000-6999 are reserved for Optimizer errors.
  • 7000-7999 are reserved for Generator errors.
  • 8000-8999 are reserved for Executor errors.
  • 9200-9250 are reserved for the UPDATE STATISTICS utilities.
  • 10000-10199 are reserved for Sort errors.
  • 11000-11099 are reserved for run-time Triggers errors.
  • 11100-11199 are reserved for UDR server errors.
  • 11200-11299 are reserved for Language Manager errors.
  • 12000-12499 are reserved for Materialized View errors.
  • 13000-13999 are reserved for SQL Preprocessor errors.
  • 15000-15099 are reserved for Sqlci errors.
  • 15200-15999 are reserved for Report Writer errors.
  • 19000 and up are used by various utilities.

REGEXP Run Time Internals

We use the open source RE2 package for run-time semantics. The eval method of our new ExRegexpClauseChar class will create an RE2 object on the stack, pass the regular expression and the value to be matched to it, and handle the results passed back.

RE2 does not support UCS2. Rather than adding such support, it seems simpler to convert UCS2 strings in the query plan to UTF8 strings instead, and use the UTF8 support in RE2.

The Code Changes

Here we list the code changes to implement REGEXP.

We actually didn't develop them quite this way. We made some wrong turns along the way as far as design choices, and refactored our work as a result. The interested reader can jump to Wrong Turns to see what we really did.

Syntax Plus Stub

The first change we did was to add the REGEXP predicate syntax to the parser. In this step we have not created classes to represent REGEXP yet. So we added stub logic to the parser actions to raise an error if REGEXP is specified. Specific changes:

  1. sql/parser/ParKeyWords.cpp: This file contains a table of keywords used in the Trafodion SQL language. This table maps a string to a token enum. It also suggests whether the token is regarded as a reserved word or not. When a keyword is encountered in lexical analysis, we do a lookup in this table to see what token it maps to. We added an entry for "REGEXP", mapping to the TOK_REGEXP token. We flagged this keyword as non-reserved. (REGEXP is not a reserved word in the ANSI standard.) By defining this token as non-reserved, users can continue to create columns named REGEXP, without using delimited identifier syntax.
  2. sql/parser/sqlparser.y: This file contains the Bison parser specification for Trafodion. We made changes to this file to recognize the REGEXP predicate. Specifically:
    1. Defined the TOK_REGEXP token (by adding a %token <tokval> statement)
    2. Defined the types returned by our new productions below (by adding %type statements)
    3. Added productions to recognize the REGEXP predicate. We mimicked what was already done for LIKE, except we omitted the ESCAPE clause. We commented out the semantic actions, though; and instead added code to raise an error -9980 if that production is activated.
    4. Added TOK_REGEXP as an alternative in the nonreserved_word production.
  3. sql/bin/SqlciErrors.txt: Added a line to this file defining a new error, 9980, with the text, "The REGEXP predicate is not yet implemented."

With these changes, the REGEXP predicate is now recognized by Trafodion. We get output like this:

>>select * from temp where b regexp '%xyz%';
 
*** ERROR[9980] The REGEXP predicate is not yet implemented.
 
*** ERROR[8822] The statement was not prepared.

Initial Refactoring of the Like Class

Study of the Like methods revealed that much of the code there also applies to Regexp. Rather than create Regexp as a peer class and clone the code, we've chosen to refactor the Like class into a parent class, PatternMatchingFunction with a leaf class Like. In the next set of changes, we'll add Regexp as a second leaf class. The methods and data members that look useful for both LIKE and REGEXP are pushed up into the PatternMatchingFunction class, while methods that seem LIKE-specific will remain in the Like class. Specific changes:

  1. sql/optimizer/ItemFunc.h: Added the definition of the PatternMatchingFunction class. Changed the Like class to inherit from it. Moved all the data members from Like up into PatternMatchingFunction, as they look like they will have relevance to REGEXP as well. Moved many of the methods up. Changed the Like::applyBeginEndKeys method to virtual; made that method pure virtual in the PatternMatchingFunction class. At this stage, we made most of the data members "protected" rather than "private"; in a later set of changes we'll modify the remaining Like class predicates to use getters and setters on these members instead. (One member already had adequate getters and setters which were being used everywhere, and was painlessly made private in this set of changes.)
  2. sql/optimizer/BindItemExpr.cpp: We renamed the Like::bindNode and Like::beginEndKeysApplied methods to PatternMatchingFunction::bindNode and PatternMatchingFunction::beginEndKeysApplied respectively. Note that the Like::applyBeginEndKeys method remains in the Like class; it contains optimizations that are LIKE-specific. (We may create some analogous optimizations for REGEXP later.)
  3. sql/optimizer/ItemExpr.cpp: Added destructor for PatternMatchingFunction.
  4. sql/optimizer/SynthType.cpp: We renamed the Like::synthesizeType method to PatternMatchingFunction::synthesizeType.

We ran the full regression suite after this set of changes to make sure our refactoring didn't break LIKE functionality.

Regexp Class and its Binder Methods

With this set of changes, we add support for static semantic checks for the REGEXP predicate. That is, we do type checking of its operands. It turns out to be convenient to take care of the query caching logic at this time as well. Our stub to raise an error when REGEXP is specified is moved out of the parser and into the code generator.

  1. sql/optimizer/ItemFunc.h: Add the Regexp class. We clone the (refactored) Like class, then comment out all but the constructor/destructor and binder methods. (We delete the two constructors that take an escapeChar argument also, because REGEXP does not have an ESCAPE clause.) It turns out to be hard to stub out the cache analysis methods, so we don't comment those out either. Also, pushed the generateCacheKey and normalizeCache methods up from the Like class to the PatternMatchingFunction class.
  2. sql/common/OperTypeEnum.h: Added ITM_REGEXP to the operator type enumeration.
  3. sql/parser/sqlparser.y: Uncommented the semantic actions in the REGEXP predicate productions, so that a Regexp class object will be created when such productions are reduced. Removed the call to generate a stub error message.
  4. sql/optimizer/ItemCache.cpp: Decided that the two cache methods, Like::generateCacheKey and Like::normalizeForCache applied equally well to Regexp, and so pushed these up to the PatternMatchingFunction class. (The original intention was to keep these specialized, and put stub logic in there to raise an error for REGEXP; however the cacheable query logic does not expect errors, so our stub error would not be caught. The methods turn out to be simple; so we just took care of their implementations now.)
  5. sql/optimizer/BindItemExpr.cpp: Implement Regexp::applyBeginEndKeys method. For now, it does not do any optimizations. It does however contain the check restricting the collating sequence to the default collation. Also, we added logic here to raise our stub error (9980) in the case that REGEXP on double-byte character set operands is specified. (We add the double-byte support in a later change set.)
  6. sql/optimizer/ItemExpr.cpp: Add Regexp destructor and Regexp::copyTopNode methods.
  7. sql/bin/SqlciErrors.txt: Added error 4392, "The REGEXP predicate only supports the default collating sequence."
  8. sql/generator/GenItemFunc.cpp: Added a case to BuiltinFunction::codeGen, to raise our stub error (9980) in the ITM_REGEXP case.

With these code changes, we can now do negative tests on type checking. For example:

>>create table test1(a integer not null, b date not null) no partition;
 
--- SQL operation complete.
>>select * from test1 where a regexp '[x]*';
 
*** ERROR[4041] Type INTEGER cannot be compared with type CHAR(4).
 
*** ERROR[4050] The operands of the REGEXP predicate must be comparable character data types (that is, of the 
same character set and collation).
 
*** ERROR[8822] The statement was not prepared.
                       
>>select * from test1 where b regexp '[x]*';
 
*** ERROR[4041] Type DATE cannot be compared with type CHAR(4).
 
*** ERROR[4050] The operands of the REGEXP predicate must be comparable character data types (that is, of the 
same character set and collation).
 
*** ERROR[8822] The statement was not prepared.
>>

Code Generation, Stubbed Run-Time Evaluation

The next set of changes will define run-time clause classes for the REGEXP predicate, though we will stub their evaluation routines so that any attempt to execute a REGEXP predicate will result in our friendly -9980 stub error.

Note that we have skipped over the implementation of normalizer and optimizer methods for the moment. We do this because testing these requires the ability to generate a plan, and this can't be done until we can generate the run-time classes. What this means, though, is that we'll get default selectivity for all REGEXP predicates, and we won't get such transformations as the left join-to-inner join transformation when a REGEXP predicate references a null-instantiated column.

We make the following changes:

  1. sql/optimizer/BindItemExpr.cpp: While checking to see what code uses ITM_LIKE, we came across the function ItemExpr::shouldPushTranslateDown, which figures out whether to push a TRANSLATE operation down an operator tree. We added cases here to treat REGEXP in the same way as LIKE.
  2. sql/optimizer/ItemExpr.cpp: We also came across the function, BuiltinFunction::getText, which is used by certain compiler debugging tools. We added a case here for REGEXP.
  3. sql/generator/GenItemFunc.cpp: Added code to BuiltinFunction::codeGen to generate an ExRegexpClauseChar object for a REGEXP predicate on a single-byte character set value. (Note that REGEXP predicates on double-byte character set values don't get past the Binder.)
  4. sql/exp/exp_clause_derived.h: Added class definitions for ExRegexpClauseBase and ExRegexpClauseChar. These class definitions mirror those for ex_like_clause_base etc, except that we didn't create an ExRegexpClauseDoubleByte class. We made this a class hierarchy anyway though (rather than collapse Base and Char together) on the off-chance that in the future we might want a double-byte run-time implementation. Note that we have ExRegexpClauseBase inheriting directly from ex_clause (as does ex_like_clause_base); there didn't seem to be any useful refactoring of the LIKE run-time classes that would benefit REGEXP. Note also that we used the camel-case naming convention for our new classes; the ex_like_clause_base classes were defined very early before the camel-case naming convention was put in place.
  5. sql/exp/exp_like.cpp: This file contains run-time methods used for LIKE processing. We chose to add the methods for REGEXP here also; it would have been just as valid to create a new module instead. We implemented ExRegexpClauseBase::processNulls here. This method gets called before the eval method in order to deal with null values. If any of the operands to REGEXP are null, the predicate returns null. (This is the one method that we could have shared with LIKE; LIKE's null semantics are similar. We decided that this wasn't enough to justify refactoring.) Also in this file, we implemented the non-trivial constructor for ExRegexpClauseChar.
  6. sql/exp/exp_clause.h: Added REGEXP_CLAUSE_CHAR_ID to the clause_type enum. These are class IDs needed by fix-up logic in order to instantiate the proper class upon deserialization.
  7. sql/exp/exp_clause.cpp: Added a case to ex_clause::ex_clause for ITM_REGEXP. This case sets the class ID for the object. The class ID is used for deserialization when the object is sent to another process. Added a case to ex_clause::findVTblPtr for REGEXP_CLAUSE_CHAR_ID. Added the trivial constructor and the displayContents method for ExRegexpClauseChar. Added a case for ITM_REGEXP to the getOperTypeEnumAsString function. This function is used by the SHOWPLAN utility to display expressions generated in a plan.
  8. sql/exp/ExpPackDefs.cpp: Added a pack method for ExRegexpClauseChar.

With these changes, a query using the REGEXP predicate can be compiled. The query will even execute successfully against an empty table or a table containing only null values in the expressions referenced by REGEXP.

Run-Time Changes for Single-Byte Character Sets

When considering how to implement the run-time semantics, we briefly considered implementing our own regular expression engine. For the basic regular expression operators, +, *, ? and |, this is straight-forward. One can find, for example, an algorithm to do this in Aho, Sethi and Ullman's book, "Compilers", on p. 135-141. To make this industrial-strength, though, one needs to add support for wild cards and character classes. A regular expression engine must parse the pattern, and implement matching semantics. Fortunately, there is an open source engine available, RE2, with an appropriate license. This engine is DFA-based (like the Aho, Sethi, Ullman algorithm), and has a small memory footprint. It implements a very rich set of regular expression support, close to the Posix standard. We chose to use this package for our run-time semantics.

This set of changes adds the run-time semantics for single-byte character sets. The double-byte support remains stubbed.

  1. sql/exp/exp_like.cpp: Added include for re2/re2.h. Added ExRegexpClauseChar::eval implementation. The code changes are much simpler than ex_like_clause_char::eval, because we use the RE2 engine. The code changes consist of code to obtain the addresses and lengths of the operands (cloned from ex_like_clause_char::eval), then a call to the RE2 engine. One detail is that ExRegexpClauseChar::eval handles both the ISO88591 and UTF8 character sets. Fortunately, RE2 does also. But we have to pass in an option to RE2 to indicate what encoding we are passing. The engine may return an error if the regular expression pattern is syntactically invalid; we have logic to report such errors. Our implementation is somewhat inefficient, actually: We build the DFA each time ExRegexpClauseChar::eval is called. This is unnecessary for most use, however; often the regular expression is supplied as a constant string or an input parameter to the SQL statement. It would be better if we constructed the DFA once, and reused it for subsequent pattern matching. We leave that refinement to a later change. This implementation does avoid the memory management complications described earlier, since the DFA is built by an RE2 object that is declared as a stack variable.
  2. sql/nskgmake/Makerules.linux: Added logic to copy RE2 library libre2.so to appropriate place. Added -I, -l and -L flags so header file and library can be accessed on compiles and links.
  3. sql/bin/SqlciErrors.txt: Added an error message to be issued when an invalid regular expression is specified. The error message includes as its String0 parameter the text message returned from RE2.
  4. exp/ExpErrorEnums.h: Added a literal for the error message number.

Double-Byte Character Set Support

Trafodion supports UCS2 (an encoding of Unicode) as its only double-byte character set. This set of changes adds the run-time semantics for UCS2. Since this is the final bit of code required to have a fully-functional implementation, we also remove the stub error message we added in our very first step.

The RE2 library does not support UCS2-encoded characters. So, to support REGEXP for UCS2, we need to translate the UCS2 strings to UTF8 strings, which RE2 does understand. This will always work because UTF8 is a superset of UCS2.

We have a design choice to make here. We could mimic LIKE, and let REGEXP have two run-time implementations. The double-byte implementation would translate UCS2 strings to UTF8 strings on the fly. Alternatively, we could let the Compiler do the work for us. If we find a REGEXP predicate with UCS2 operands, we do the following transformation: We stick TRANSLATE nodes on top of each operand sub-tree, making them UTF8 strings, and generate a new REGEXP node with these TRANSLATE nodes as children. Now, the rest of compilation and execution deals only with the single-byte variation of REGEXP.

Besides simplifying the run-time implementation, doing this in the compiler has some other benefits: If an operand is constant, the TRANSLATE function may be constant-folded at compile time, so the need to translate it on the fly at run-time is eliminated.

The set of code changes made here include:

  1. sql/optimizer/BindItemExpr: In Regexp::applyBeginEndKeys, add logic to check for UCS2 character set. If specified, transform the REGEXP tree to one on UTF8 operands as described above.
  2. sql/bin/SqlciErrors.txt: Remove our stub error code, 9980.

Normalizer Changes

The previous set of changes gives us a fully functional implementation of REGEXP. The remaining changes add certain optimizations, inspired by those that Trafodion already implements for LIKE.

The Normalizer pass of the compiler performs heuristic optimizations, that is, optimizations that are thought to be always good. One of these is the left join to inner join transformation. If a query has a predicate on a null-instantiated value that eliminates rows having a null for that value, then the left join query is equivalent to an inner join query. One can simply change the left join to an inner join. This has some run-time benefits: We can eliminate logic dealing with null-instantiation (a path length reduction), and we can enable the Optimizer to consider a richer set of join orders (for example, one can commute the operands of an inner join but not a left join).

The Like class already has a virtual method, Like::predicateEliminatesNullAugmentedRows, that enables this transformation for LIKE. Fortunately for us, the same code works for REGEXP. So, we can simply move this method up into the PatternMatchingFunction class. The specific code changes:

  1. sql/opt/ItemFunc.h: Move the method declaration for predicateEliminatesNullAugmentedRows up from the Like class into the PatternMatchingFunction class.
  2. sql/opt/NormItemExpr.cpp: Rename Like::predicateEliminatesNullAugmentedRows to PatternMatchingFunction::predicateEliminatesNullAugmentedRows.

To unit test this change, we can use the showplan utility to see the plan change. For example, before this change we will see a left join in the generated plan for the following query:

>>select * from temp1 left join temp2 on temp1.a = temp2.b
+> where temp2.b regexp 'a*[ ]*';
 
A     B                     A     B                   
----  --------------------  ----  --------------------
 
aaaa  aaaa                  aaaa  aaaa                
 
--- 1 row(s) selected.
>>showplan select * from temp1 left join temp2 on temp1.a = temp2.b
+> where temp2.b regexp 'a*[ ]*';
 
... a bunch of output, but ultimately you'll see a join node...
 
Contents of EX_HASHJ [6]:
-------------------------
 
For ComTdb :
Class Version = 1, Class Size = 528
InitialQueueSizeDown = 4, InitialQueueSizeUp = 4
queueResizeLimit = 9, queueResizeFactor = 4
queueSizeDown = 2048, queueSizeUp = 2621, numBuffers = 1, bufferSize = 262144
estimatedRowUsed = 100, estimatedRowsAccessed = 0, expressionMode = 0
Flag = 0000000000101001
 
For ComTdbHashj :
hjFlags = 0000000000000010, isSemiJoin = 0, isLeftJoin = 2, isRightJoin = 0
isAntiSemiJoin = 0, isUniqueHashJoin = 0, isNoOverflow = 0, isReuse = 0
bufferedWrites = 0, logDiagnostics = 0, hashBufferSize = 262144
memoryQuotaMB = 1200, numClusters = 4
 
...a bunch more output...

After the change, we see an inner join:

 

Contents of EX_HASHJ [6]:
-------------------------
 
For ComTdb :
Class Version = 1, Class Size = 528
InitialQueueSizeDown = 4, InitialQueueSizeUp = 4
queueResizeLimit = 9, queueResizeFactor = 4
queueSizeDown = 2048, queueSizeUp = 3542, numBuffers = 1, bufferSize = 262144
estimatedRowUsed = 100, estimatedRowsAccessed = 0, expressionMode = 0
Flag = 0000000000101001
 
For ComTdbHashj :
hjFlags = 0001000000000000, isSemiJoin = 0, isLeftJoin = 0, isRightJoin = 0
isAntiSemiJoin = 0, isUniqueHashJoin = 1, isNoOverflow = 0, isReuse = 0
bufferedWrites = 0, logDiagnostics = 0, hashBufferSize = 262144
memoryQuotaMB = 1200, numClusters = 4

Optimizer Changes

As of this writing, we have not attempted these changes yet. There are two changes that should be considered.

  1. We should consider adding logic to Regexp::applyBeginEndKeys to check for a constant regular expression pattern, and if one is found, do a light parse of it to see if it begins with a single string of characters (as opposed, say, to a *-operator expression). If so, we can introduce comparison predicates which are more efficient. This is similar to the LIKE logic. Information about characters in the pattern could be stored in the Regexp object for Regexp::defaultSel below.
  2. We should consider implementing a Regexp::defaultSel method. This method would return 1.0 if comparison predicates were introduced in Regexp::applyBeginEndKeys. Otherwise, it would check for places in the pattern where there are single characters (as computed by Regexp::applyBeginKeys), and compute a selectivity from that, similar to the LIKE logic. If there is no constant pattern, then the default selectivity from ItemExpr::defaultSel could be returned.

Run-time Optimization

As of this writing, we have not attempted these changes yet.

We mentioned earlier that our implementation is inefficient: For each row, we build a DFA, even when the regular expression is constant in the query. Almost always, the regular expression will be constant (at least this is the common usage pattern with LIKE), so it pays to consider constructing the DFA just once (per unit of parallelism) and re-using it. This should result in substantial path length reduction for the REGEXP predicate.

To build the DFA just once requires that we create the RE2 object on the heap instead of as a stack variable. This forces us to consider the peculiarities of Executor memory management. The RE2 class (and the objects it creates) use the usual C++ new and delete, and it would be a major effort to change them to use Executor memory management. Instead, we propose to create a new scheme for cleaning up these objects:

  1. Create an abstract base class, Cleaner, which is responsible for cleaning up some Executor resource. The base class can link Cleaner objects into a list. Perhaps the object inherits from NABasicObject, so we don't have to worry about cleaning it up. The class would have a virtual method, cleanup(), which cleans up the objects that Cleaner is responsible for.
  2. Create a leaf class, RE2Cleaner, which encapsulates a delete of an RE2 object. It would have one data member: A pointer to a pointer to an RE2 object. Its cleanup method deletes the RE2 object, and zeroes out the pointer to the RE2 object.
  3. Add a member to the ExRegexpClauseChar class to point to the RE2 object. The fixup method initializes it to either null or creates an RE2 object and stores a pointer there. In the latter case, it creates an RE2Cleaner object and passes it back to the Executor globals.
  4. In those places in the Executor where we delete a plan fragment by deleting its heap, we would walk the list of Cleaner objects in the Executor globals for that fragment, executing their cleanup methods.

Wrong Turns

When we developed this exercise, we made some wrong design choices along the way. As we learned more, we backtracked and refactored. We thought these were instructive, so here's a list of the wrong turns we made:

  1. Initially, we implemented the ANSI SIMILAR syntax in the compiler. Only when we got to the run-time implementation did we realize that writing our own custom DFA generator would be a lot of work. That's when we foundRE2, which led us to more critically examine the externals. Finding that the industry had trended toward POSIX-style regular expressions, we decided RE2 was the way to go. That resulted in a day of refactoring: Renaming Similar to Regexp, and simplifying logic to remove handling of the ESCAPE clause.
  2. Then we completed run-time support for single-byte character sets. We created a run-time class hierarchy imitating LIKE, with a single-byte leaf class and a double-byte leaf class. We stubbed the double-byte support and merrily got the single-byte support working. When we got around to doing the double-byte support, we discovered that since RE2 didn't support UCS2 encoding, we needed to translate strings to UTF8 first. With a little thought, we realized that it was better to compile that translation into the query plan rather than do it in the run-time. It was a rather simple change, actually; less than 15 lines of code. The refactoring that resulted here was to delete the double-byte run-time leaf class.