Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Components Involved

There are a number of different parts to the overall effort.

  1. Propagating the result of Hive's existing predicate pushdown. Hive's optimizer already takes care of the hard work of pushing predicates down through the query plan (controlled via configuration parameter hive.optimize.ppd=true/false). The "last mile" remaining is to send the table-level filters down into the corresponding input formats.
  2. Selection of a primary filter representation to be passed to input formats. This representation needs to be neutral (independent of the access plans which will use it) and loosely coupled with Hive (so that storage handlers can choose to minimize their dependencies on Hive internals).
  3. Helper classes for interpreting the primary representation. Many access plans will need to analyze filters in a similar fashion, e.g. decomposing conjunctions and detecting supported column comparison patterns. Hive should provide sharable utilities for such cases so that they don't need to be duplicated in each access method's code.
  4. Converting filters into a form specific to the access method. This part is dependent on the particular access method; e.g. for HBase, it involves converting the filter condition into corresponding calls to set up an HBase scan object.

Primary Filter Representation

To achieve the loosest possible coupling, we are going to use a string
as the primary representation for the filter. In particular, the string
will be in the form produced when Hive unparses an ExprNodeDesc, e.g.

...

Suppose a storage handler is capable of implementing the range scan
for scanfor x > 3, but does not have a facility for evaluating {{upper(thumbs up) (y) =
'XYZ'}}. In this case, the optimal plan would involve decomposing the
filter, pushing just the first part down into the storage handler, and
leaving only the remainder for Hive to evaluate via its own executor.

In order for this to be possible, the storage handler needs to be able
to negotiate the decomposition with Hive. This means that Hive gives
the storage handler the entire filter, and the storage handler passes
back a "residual": the portion that needs to be evaluated by Hive. A null residual indicates that the storage handler was able to deal with the entire
filter on its own (in which case no FilterOperator is needed).

...