Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This particular enhancement is to the Trafodion Compiler and Executor, so design illustrations are limited to those components.

 

General Approach

Aside from the fact that pattern matching is a useful feature in its own right, we have two goals for this particular enhancement:

...

  • We created a Launchpad blueprint to track the feature.
  • We took a first look at externals, observing how other products support regular expression matching, and how the ANSI standard supports it. It turns out that the ANSI standard is not followed in the industry; we chose a syntax that is more like the industry.
  • We looked for analogies in the existing Trafodion code that are similar to pattern matching. Fortunately, there is one that is very much like it: the LIKE predicate.
  • We researched how LIKE is implemented at a high level. That is, we studied the classes that implement LIKE, and where they are invoked in Compiler and Executor processing. As a first cut, we imagined implementing regular expressions similarly. We also imagined breaking up our implementation by passes, for example one change set for parser changes, another for binder changes and so on.
  • We read the LIKE code in more detail. This led to knowledge of various subtleties that made us study regular expressions externals more closely, to understand how they matched and how they differed. It also led us to understand that quite a bit of the LIKE implementation also applied to regular expressions, which made us consider refactoring the classes that implement LIKE. Finally, it opened up to us additional ways to break up the changes into smaller pieces: For example, certain compile time optimizations could be deferred to a separate set of changes.
  • We documented the regular expression pattern matching externals.
  • We began making small sets of code changes. With each code change, we unit tested, and ran regression tests. If we were actively developing this on the Master branch, we would have contributed each small set for review.

Launchpad Blueprint

The Launchpad Blueprint under which this work is tracked is at https://blueprints.launchpad.net/trafodion/+spec/regular-expression-predicate.

Externals

To decide on externals we looked at several possibilities.

...

The reader who is more interested in Compiler and Executor design rather than the REGEXP predicate itself may skip this section and go directly to Internals.

Syntax

The syntax consists of two parts: the REGEXP predicate itself and the regular expression used in pattern matching.

The REGEXP Predicate

Here is the syntax for the REGEXP predicate:

...

ANSI SQL supports a notion of collating sequences. One could imagine using a case-insensitive collating sequence with REGEXP, getting case-insensitive pattern matching as a result. However, collating sequence support in general turns out to be complicated. To limit the scope of this work, we'll restrict ourselves to the default collating sequence, and will raise a compile time error if a non-default collating sequence is specified or implied.

The Regular Expression

We will use the RE2 package unaltered. The syntax of regular expressions supported by RE2 is available on Google's re2 syntax page.

...

*** ERROR[8442] Invalid regular expression: no argument for repetition operator: *

Run Time Semantics

If either operand to REGEXP is a null value, the REGEXP predicate will return null. This is similar to the LIKE predicate behavior.

Assuming both operands are not null, and the regular expression specified is valid, the predicate will return true if the value specified matches the regular expression in its entirety, and false otherwise.

Internals

In this section, we give an overview of Compiler and Executor internals, and describe how we implement the REGEXP predicate within these internals.

Query Compilation and Execution Flow

A query in Trafodion passes through several phases:

  1. Compilation
  2. Fix-up
  3. Execution
  4. Awaiting clean-up or reuse

Compilation

During compilation, a query is transformed from query text into various intermediate trees then finally into a query plan. As the query is parsed, it is represented as a syntax tree. This tree is then bound to various metadata, then transformed into a tree suitable for optimization. Optimization explores a query plan space by manipulating this tree further, making partial copies of the tree for each explored plan. Finally, the chosen plan is generated. During compilation one set of C++ classes represents the nodes in the query tree. At code generation, the query plan is transformed into a different representation in a different set of C++ classes. We discuss some of these below.

Fix-up

As mentioned, Trafodion query plans consist of C++ objects generated by the Trafodion Compiler. These objects are organized into fragments, each fragment representing a piece of the query plan that runs in a particular process. Since query plans run in several processes, Trafodion must serialize and deserialize these objects so they can be communicated across process boundaries.

At query execution time, the initial phase of processing is called fix-up. During fix-up, the Trafodion Executor acquires any processes it needs and downloads the query plan fragments to each. Once downloaded, the fragment is deserialized, and additional run-time objects are created. Any files that need to be opened are opened.

Execution

Execution begins by passing an input row consisting of input parameters to the root node of the query. This row then flows down the tree of relational operators with appropriate transformations along the way. Eventually the inputs reach the leaf nodes of the query. For a SELECT these would be scan nodes. Those nodes in turn read rows that are passed up the tree. Each operator informs its parent operator when it has returned all rows satisfying a given input.

A query ends execution in one of three ways. It may end because all the output rows have been processed. It may end because an error was detected during execution. It may end early because the client has closed the output cursor before consuming all the output rows, or because the query has been cancelled. Since operators execute asynchronously and communicate via queues, error conditions or early close/cancel will cause a cancellation indication to be passed down all or part of the query tree. Eventually, though, all operators quiesce.

Awaiting Clean-up or Reuse

Queries can be explicitly cleaned up (for example, by reusing the statement name when preparing a query). If they are not, the query plan remains available in fixed-up state on the chance that the query is re-executed.

ExprNode, RelExpr and ItemExpr Class Hierarchies

Within the Compiler, all SQL expressions are represented by an object in the ExprNode class hierarchy. These can be subdivided into relational expressions (having a value of a relation) and item expressions (having a scalar or row value). This subdivision in turn leads to two sub-trees of the ExprNode hierarchy, the RelExpr and ItemExpr trees respectively.

...

Quite often in these rewrites, the new sub-tree will contain pieces of the old parse tree. Binding, normalizing, etc. are recursive processes. This means that sometimes an old parse node will be revisited (say, for binding). For this reason, you will often see logic in these methods that makes them idempotent; for example, the bindNode methods usually check to see if the node has already been bound before proceeding.

Run-time Class Hierarchies

The ExprNode class hierarchy is compile-time only. The compiler translates the query tree into a query plan. The main class hierarchies in the query plan are the ex_tcb hierarchy (task control blocks), and the ex_clause hierarchy (scalar expression operators). There is a rough correspondence between RelExpr classes and ex_tcb classes, and also between ItemExpr classes and ex_clause classes.

Unlike RelExpr and ItemExpr, however, ex_tcb and ex_clause do not share a unique base class. (They do both ultimately inherit from NABasicObject, but other hierarchies also inherit from NABasicObject.) The reason is that the query plan implementation of relational operators is quite different from scalar operators. Relational operators implement a data flow architecture, and represent tasks that are executed asynchronously at run-time. Relational operators communicate with each other via queues. Scalar expression operators on the other hand are invoked in a single-threaded manner by an expression evaluation engine. Relational operators are arranged in trees in general, while scalar operators are arranged as sequences within expressions.

ex_tcb, ex_tdb Class Hierarchies

The ex_tcb class hierarchy is used to represent compile-time state of relational operators. The query plan is represented as a tree of ex_tcb objects. Each node in the plan is some leaf in this class hierarchy which represents the semantics of a particular operator. At fix-up time, Trafodion creates a parallel hierarchy of ex_tdb objects which contain the run-time state of particular operators. For most operators, there is a one-to-one correspondence between an ex_tcb object and an ex_tdb object. Exceptions are operators that deal with parallelized inter-process communication. For example, the "split bottom" operator combines n partitions of a stream of data from lower level fragments into one stream in the current, parent fragment. There is one ex_tdb object for each of the partitions, but one ex_tcb object for the operator as a whole.

ex_clause Class Hierarchy

The ex_clause class is the base class for all individual scalar operations. A scalar expression consists of a sequence of objects from the ex_clause hierarchy. Both compile time and run-time state may be stored within an ex_clause object.

...

There does not seem to be any advantage in using ex_like_clause_base as a base class for REGEXP also, nor does there seem to be a useful run-time abstraction common to LIKE and REGEXP beyond that already represented by ex_clause. Too, we decided not to do a run-time implementation for double-byte character set values; instead we decided to translate these values to UTF8 in the query plan. We will create a class hierarchy ExRegexpClauseBase that parallels the ex_like_clause_base class, but lacking a double-byte leaf class.

Memory Management

The Compiler and Executor use an optimized memory management scheme. Both create many objects per SQL statement. We sometimes avoid the overhead of calling the destructors on most of these objects by placing them in customized heaps, which we simply throw away en masse.

...

For another example, if we have a class that does not inherit from NABasicObject, we must be very careful about using such a class from an Executor class to avoid memory and other resource leaks. In our example, we are using a class RE2 from the RE2 package. If we create this object as a stack variable, we are assured it will be destroyed when it goes out of scope. On the other hand, if we wish to create this object on the C++ heap, we must track its creation outside the plan fragment, and have the agent that destroys the plan fragment's heap destroy this object also.

Error Messages

Our implementation will result in some new error checks. Also, in doing the work incrementally from the parser forward, we will want to issue a stub error message at the point where our changes stop. So, here is some general information about error message generation and handling in the Compiler and Executor.

The Compiler and Executor both use class ComDiagsArea to manage error information. The execution model in both is that when an error occurs, information about the error is inserted into ComDiagsArea, and the calling method typically returns. Later processing often checks to see if any errors have been raised before proceeding.

Creating an Error

To create an error, one uses the << operator to insert a stream of DgBase objects into a ComDiagsArea object. DgBase is a class hierarchy; the leaf classes define different tokens associated with the error. The first DgBase object in the stream is a DgSqlCode object which gives the (integer) SQL error code associated with the error. The remaining DgBase objects vary depending on what SQL error is raised. For example, an error complaining that a column does not exist in a given table would have a DgColumnName object naming the column, and a DgTableName object naming the table. Below is example code creating an error taken from sql/optimizer/BindItemExpr.cpp:

...

The ComDiagsArea class definition and the declaration of the << operator are in file sql/export/ComDiagsArea.h. One interesting thing to note about ComDiagsArea is that it inherits from IpcMessageObj. This is because we sometimes pass error information across process boundaries. The DgBase class hierarchy is defined in file sql/common/DgBaseType.h.

Defining the Text for an Error

SQL errors are usually presented to end users in text form. The code that translates error diagnostics to text form uses a template file found in sql/bin/SqlciErrors.txt. There is a line in this file for each SQL error code. The format of each line is:

...

  • 1000-1999 are reserved for DDL errors.
  • 2000-2999 are reserved for Compiler main errors.
  • 3000-3999 are reserved for Parser errors.
  • 4000-4999 are reserved for Binder errors.
  • 5000-5999 are reserved for Normalizer errors.
  • 6000-6999 are reserved for Optimizer errors.
  • 7000-7999 are reserved for Generator errors.
  • 8000-8999 are reserved for Executor errors.
  • 9200-9250 are reserved for the UPDATE STATISTICS utilities.
  • 10000-10199 are reserved for Sort errors.
  • 11000-11099 are reserved for run-time Triggers errors.
  • 11100-11199 are reserved for UDR server errors.
  • 11200-11299 are reserved for Language Manager errors.
  • 12000-12499 are reserved for Materialized View errors.
  • 13000-13999 are reserved for SQL Preprocessor errors.
  • 15000-15099 are reserved for Sqlci errors.
  • 15200-15999 are reserved for Report Writer errors.
  • 19000 and up are used by various utilities.

REGEXP Run Time Internals

We use the open source RE2 package for run-time semantics. The eval method of our new ExRegexpClauseChar class will create an RE2 object on the stack, pass the regular expression and the value to be matched to it, and handle the results passed back.

RE2 does not support UCS2. Rather than adding such support, it seems simpler to convert UCS2 strings in the query plan to UTF8 strings instead, and use the UTF8 support in RE2.

The Code Changes

Here we list the code changes to implement REGEXP.

We actually didn't develop them quite this way. We made some wrong turns along the way as far as design choices, and refactored our work as a result. The interested reader can jump to Wrong Turns to see what we really did.

Syntax Plus Stub

The first change we did was to add the REGEXP predicate syntax to the parser. In this step we have not created classes to represent REGEXP yet. So we added stub logic to the parser actions to raise an error if REGEXP is specified. Specific changes:

...

>>select * from temp where b regexp '%xyz%';
 
*** ERROR[9980] The REGEXP predicate is not yet implemented.
 
*** ERROR[8822] The statement was not prepared.

Initial Refactoring of the Like Class

Study of the Like methods revealed that much of the code there also applies to Regexp. Rather than create Regexp as a peer class and clone the code, we've chosen to refactor the Like class into a parent class, PatternMatchingFunction with a leaf class Like. In the next set of changes, we'll add Regexp as a second leaf class. The methods and data members that look useful for both LIKE and REGEXP are pushed up into the PatternMatchingFunction class, while methods that seem LIKE-specific will remain in the Like class. Specific changes:

...

We ran the full regression suite after this set of changes to make sure our refactoring didn't break LIKE functionality.

Regexp Class and its Binder Methods

With this set of changes, we add support for static semantic checks for the REGEXP predicate. That is, we do type checking of its operands. It turns out to be convenient to take care of the query caching logic at this time as well. Our stub to raise an error when REGEXP is specified is moved out of the parser and into the code generator.

...

>>create table test1(a integer not null, b date not null) no partition;
 
--- SQL operation complete.
>>select * from test1 where a regexp '[x]*';
 
*** ERROR[4041] Type INTEGER cannot be compared with type CHAR(4).
 
*** ERROR[4050] The operands of the REGEXP predicate must be comparable character data types (that is, of the 
same character set and collation).
 
*** ERROR[8822] The statement was not prepared.
                       
>>select * from test1 where b regexp '[x]*';
 
*** ERROR[4041] Type DATE cannot be compared with type CHAR(4).
 
*** ERROR[4050] The operands of the REGEXP predicate must be comparable character data types (that is, of the 
same character set and collation).
 
*** ERROR[8822] The statement was not prepared.
>>

Code Generation, Stubbed Run-Time Evaluation

The next set of changes will define run-time clause classes for the REGEXP predicate, though we will stub their evaluation routines so that any attempt to execute a REGEXP predicate will result in our friendly -9980 stub error.

...

With these changes, a query using the REGEXP predicate can be compiled. The query will even execute successfully against an empty table or a table containing only null values in the expressions referenced by REGEXP.

Run-Time Changes for Single-Byte Character Sets

When considering how to implement the run-time semantics, we briefly considered implementing our own regular expression engine. For the basic regular expression operators, +, *, ? and |, this is straight-forward. One can find, for example, an algorithm to do this in Aho, Sethi and Ullman's book, "Compilers", on p. 135-141. To make this industrial-strength, though, one needs to add support for wild cards and character classes. A regular expression engine must parse the pattern, and implement matching semantics. Fortunately, there is an open source engine available, RE2, with an appropriate license. This engine is DFA-based (like the Aho, Sethi, Ullman algorithm), and has a small memory footprint. It implements a very rich set of regular expression support, close to the Posix standard. We chose to use this package for our run-time semantics.

...

  1. sql/exp/exp_like.cpp: Added include for re2/re2.h. Added ExRegexpClauseChar::eval implementation. The code changes are much simpler than ex_like_clause_char::eval, because we use the RE2 engine. The code changes consist of code to obtain the addresses and lengths of the operands (cloned from ex_like_clause_char::eval), then a call to the RE2 engine. One detail is that ExRegexpClauseChar::eval handles both the ISO88591 and UTF8 character sets. Fortunately, RE2 does also. But we have to pass in an option to RE2 to indicate what encoding we are passing. The engine may return an error if the regular expression pattern is syntactically invalid; we have logic to report such errors. Our implementation is somewhat inefficient, actually: We build the DFA each time ExRegexpClauseChar::eval is called. This is unnecessary for most use, however; often the regular expression is supplied as a constant string or an input parameter to the SQL statement. It would be better if we constructed the DFA once, and reused it for subsequent pattern matching. We leave that refinement to a later change. This implementation does avoid the memory management complications described earlier, since the DFA is built by an RE2 object that is declared as a stack variable.
  2. sql/nskgmake/Makerules.linux: Added logic to copy RE2 library libre2.so to appropriate place. Added -I, -l and -L flags so header file and library can be accessed on compiles and links.
  3. sql/bin/SqlciErrors.txt: Added an error message to be issued when an invalid regular expression is specified. The error message includes as its String0 parameter the text message returned from RE2.
  4. exp/ExpErrorEnums.h: Added a literal for the error message number.

Double-Byte Character Set Support

Trafodion supports UCS2 (an encoding of Unicode) as its only double-byte character set. This set of changes adds the run-time semantics for UCS2. Since this is the final bit of code required to have a fully-functional implementation, we also remove the stub error message we added in our very first step.

...

  1. sql/optimizer/BindItemExpr: In Regexp::applyBeginEndKeys, add logic to check for UCS2 character set. If specified, transform the REGEXP tree to one on UTF8 operands as described above.
  2. sql/bin/SqlciErrors.txt: Remove our stub error code, 9980.

Normalizer Changes

The previous set of changes gives us a fully functional implementation of REGEXP. The remaining changes add certain optimizations, inspired by those that Trafodion already implements for LIKE.

...

Contents of EX_HASHJ [6]:
-------------------------
 
For ComTdb :
Class Version = 1, Class Size = 528
InitialQueueSizeDown = 4, InitialQueueSizeUp = 4
queueResizeLimit = 9, queueResizeFactor = 4
queueSizeDown = 2048, queueSizeUp = 3542, numBuffers = 1, bufferSize = 262144
estimatedRowUsed = 100, estimatedRowsAccessed = 0, expressionMode = 0
Flag = 0000000000101001
 
For ComTdbHashj :
hjFlags = 0001000000000000, isSemiJoin = 0, isLeftJoin = 0, isRightJoin = 0
isAntiSemiJoin = 0, isUniqueHashJoin = 1, isNoOverflow = 0, isReuse = 0
bufferedWrites = 0, logDiagnostics = 0, hashBufferSize = 262144
memoryQuotaMB = 1200, numClusters = 4

Optimizer Changes

As of this writing, we have not attempted these changes yet. There are two changes that should be considered.

  1. We should consider adding logic to Regexp::applyBeginEndKeys to check for a constant regular expression pattern, and if one is found, do a light parse of it to see if it begins with a single string of characters (as opposed, say, to a *-operator expression). If so, we can introduce comparison predicates which are more efficient. This is similar to the LIKE logic. Information about characters in the pattern could be stored in the Regexp object for Regexp::defaultSel below.
  2. We should consider implementing a Regexp::defaultSel method. This method would return 1.0 if comparison predicates were introduced in Regexp::applyBeginEndKeys. Otherwise, it would check for places in the pattern where there are single characters (as computed by Regexp::applyBeginKeys), and compute a selectivity from that, similar to the LIKE logic. If there is no constant pattern, then the default selectivity from ItemExpr::defaultSel could be returned.

Run-time Optimization

As of this writing, we have not attempted these changes yet.

...

  1. Create an abstract base class, Cleaner, which is responsible for cleaning up some Executor resource. The base class can link Cleaner objects into a list. Perhaps the object inherits from NABasicObject, so we don't have to worry about cleaning it up. The class would have a virtual method, cleanup(), which cleans up the objects that Cleaner is responsible for.
  2. Create a leaf class, RE2Cleaner, which encapsulates a delete of an RE2 object. It would have one data member: A pointer to a pointer to an RE2 object. Its cleanup method deletes the RE2 object, and zeroes out the pointer to the RE2 object.
  3. Add a member to the ExRegexpClauseChar class to point to the RE2 object. The fixup method initializes it to either null or creates an RE2 object and stores a pointer there. In the latter case, it creates an RE2Cleaner object and passes it back to the Executor globals.
  4. In those places in the Executor where we delete a plan fragment by deleting its heap, we would walk the list of Cleaner objects in the Executor globals for that fragment, executing their cleanup methods.

Wrong Turns

When we developed this exercise, we made some wrong design choices along the way. As we learned more, we backtracked and refactored. We thought these were instructive, so here's a list of the wrong turns we made:

...