Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This document explains how we are planning to add support in Hive's
optimizer for pushing filters down into physical access methods. This
is an important optimization for minimizing the amount of data scanned
and processed by an access method (e.g. for an indexed key lookup), as
well as reducing the amount of data passed into Hive for further query
evaluation.

Use Cases

Below are the main use cases we are targeting.

...

To achieve the loosest possible coupling, we are going to use a string
as the primary representation for the filter. In particular, the string
will be in the form produced when Hive unparses an ExprNodeDesc, e.g.

...

In general, this comes out as valid SQL, although it may not always match
the original SQL exactly, e.g.

...

Column names in this string are unqualified references to the columns
of the table over which the filter operates, as they are known in the
Hive metastore. These column names may be different from those known
to the underlying storage; for example, the HBase storage handler maps
Hive column names to HBase column names (qualified by column family).
Mapping from Hive column names is the responsibility of the
code interpreting the filter string.

...

As mentioned above, we want to avoid duplication in code which
interprets the filter string (e.g. parsing). As a first cut, we will
provide access to the ExprNodeDesc tree by passing it along in
serialized form as an optional companion to the filter string. In followups, we will provide parsing utilities for the string form.

...

The approach for passing the filter down to the input format will
follow a pattern similar to what is already in place for pushing
column projections down.

  • org.apache.hadoop.hive.serde2.ColumnProjectionUtils encapsulates the pushdown communication
  • classes such as HiveInputFormat call ColumnProjectionUtils to set the projection pushdown property (READ_COLUMN_IDS_CONF_STR) on a jobConf before instantiating a RecordReader
  • the factory method for the RecordReader calls ColumnProjectionUtils to access this property

...

So, where will HiveInputFormat get the filter expression to be
passed down? Again, we can start with the pattern for column projections:

...

For filter pushdown, the equivalent is TableScanPPD in
org.apache.hadoop.hive.ql.ppd.OpProcFactory. Currently, it calls
createFilter, which collapsed expressions into a single expression
called condn, and then sticks that on a new FilterOperator. We can
call condn.getExprString() and store the result on TableScanOperator.

...