Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

CBO will be introduced in to Hive in three different phases. Following provides high-level overview of these phases: 

Phase 1

 Image Added

Statistics:

  • Table Cardinality
  • Column Boundary Stats: min, max, avg, number of distinct values

...

  • Introduce Operators to represent hive relational operators. Table Scan, Join, Union, Select, Filter, Group By, Distinct, Order By. These operators would implement a calling convention with physical cost for each of these operators.
  • Introduce rules to convert Joins from CommonJoin to MapJoin, MapJoin to BucketJoin, BucketJoin to SMBJoin, CommonJoin to SkewJoin.
  • Introduce rule to merge joins so that a single join operator will represent multi-way join (similar to MergedJoin in Hive).
  • Merged-Join in Hive will be translated to MultiJoinRel in Optiq.

 

Phase 2

Image Added

Statistics:

  • Histograms

 

Cost Based Optimizations:

  • Join ordering based on histograms
  • Join Algorithm – histograms are used for estimating join selectivity
  • Take advantage of additional optimizations in Optiq. The actual rules to use is TBD.

Phase 3

Image Added

Proposed Cost Model

Hive employs parallel-distributed query execution using Hadoop cluster. This implies for a given query operator tree different operators could be running on different nodes. Also same operator may be running in parallel on different nodes in the cluster, processing different partitions of the original relation. This parallel execution model induces high I/O and CPU costs. Hive query execution cost tends to be more I/O centric due to following reasons.

...