...
- Pushing filters down into Hive's builtin storage formats such as RCFile
- Pushing filters down into storage handlers such as the HBase handler (http://issues.apache.org/jira/browse/HIVE-1226
)
- Pushing filters down into index access plans once an indexing framework is added to Hive (http://issues.apache.org/jira/browse/HIVE-417
)
Components Involved
There are a number of different parts to the overall effort.
- Propagating the result of Hive's existing predicate pushdown. Hive's optimizer already takes care of the hard work of pushing predicates down through the query plan (controlled via configuration parameter hive.optimize.ppd=true/false). The "last mile" remaining is to send the table-level filters down into the corresponding input formats.
- Selection of a primary filter representation to be passed to input formats. This representation needs to be neutral (independent of the access plans which will use it) and loosely coupled with Hive (so that storage handlers can choose to minimize their dependencies on Hive internals).
- Helper classes for interpreting the primary representation. Many access plans will need to analyze filters in a similar fashion, e.g. decomposing conjunctions and detecting supported column comparison patterns. Hive should provide sharable utilities for such cases so that they don't need to be duplicated in each access method's code.
- Converting filters into a form specific to the access method. This part is dependent on the particular access method; e.g. for HBase, it involves converting the filter condition into corresponding calls to set up an HBase scan object.
Primary Filter Representation
To achieve the loosest possible coupling, we are going to use a string
as the primary representation for the filter. In particular, the string
will be in the form produced when Hive unparses an ExprNodeDesc
, e.g.
...
Suppose a storage handler is capable of implementing the range scan
for scanfor x > 3
, but does not have a facility for evaluating {{upper (y) =
'XYZ'}}. In this case, the optimal plan would involve decomposing the
filter, pushing just the first part down into the storage handler, and
leaving only the remainder for Hive to evaluate via its own executor.
In order for this to be possible, the storage handler needs to be able
to negotiate the decomposition with Hive. This means that Hive gives
the storage handler the entire filter, and the storage handler passes
back a "residual": the portion that needs to be evaluated by Hive. A null residual indicates that the storage handler was able to deal with the entire
filter on its own (in which case no FilterOperator
is needed).
...