Table of Contents | ||
---|---|---|
|
Abstract
Apache Hadoop is a framework for the distributed processing of large data sets using clusters of computers typically composed of commodity hardware. Over last few years Apache Hadoop has become the de facto platform for distributed data processing using commodity hardware. Apache Hive is a popular SQL interface for data processing using Apache Hadoop.
...
CBO will be introduced in to Hive in three different phases. Following provides high-level overview of these phases:
Phase 1
Statistics:
- Table Cardinality
- Column Boundary Stats: min, max, avg, number of distinct values
...
- Join ordering based on histograms
- Join Algorithm – histograms are used for estimating join selectivity
- Take advantage of additional optimizations in Optiq. The actual rules to use is TBD.
Phase 3
Configuration
The configuration parameter hive.cbo.enable determines whether cost-based optimization is enabled or not.
Proposed Cost Model
Hive employs parallel-distributed query execution using Hadoop cluster. This implies for a given query operator tree different operators could be running on different nodes. Also same operator may be running in parallel on different nodes in the cluster, processing different partitions of the original relation. This parallel execution model induces high I/O and CPU costs. Hive query execution cost tends to be more I/O centric due to following reasons.
...