Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: add table of contents & hive.cbo.enable

Table of Contents
maxLevel1

Abstract

Apache Hadoop is a framework for the distributed processing of large data sets using clusters of computers typically composed of commodity hardware. Over last few years Apache Hadoop has become the de facto platform for distributed data processing using commodity hardware. Apache Hive is a popular SQL interface for data processing using Apache Hadoop.

...

CBO will be introduced in to Hive in three different phases. Following provides high-level overview of these phases: 

Phase 1

 Image Modified

Statistics:

  • Table Cardinality
  • Column Boundary Stats: min, max, avg, number of distinct values

...

  • Join ordering based on histograms
  • Join Algorithm – histograms are used for estimating join selectivity
  • Take advantage of additional optimizations in Optiq. The actual rules to use is TBD.

Phase 3

Configuration

The configuration parameter hive.cbo.enable determines whether cost-based optimization is enabled or not.

Proposed Cost Model

Hive employs parallel-distributed query execution using Hadoop cluster. This implies for a given query operator tree different operators could be running on different nodes. Also same operator may be running in parallel on different nodes in the cluster, processing different partitions of the original relation. This parallel execution model induces high I/O and CPU costs. Hive query execution cost tends to be more I/O centric due to following reasons.

...