Table of Contents

maxLevel	1

Abstract

Apache Hadoop is a framework for the distributed processing of large data sets using clusters of computers typically composed of commodity hardware. Over last few years Apache Hadoop has become the de facto platform for distributed data processing using commodity hardware. Apache Hive is a popular SQL interface for data processing using Apache Hadoop.

...

CBO will be introduced in to Hive in three different phases. Following provides high-level overview of these phases:

Phase 1

Image Modified

Statistics:

Table Cardinality
Column Boundary Stats: min, max, avg, number of distinct values

...

Join ordering based on histograms
Join Algorithm – histograms are used for estimating join selectivity
Take advantage of additional optimizations in Optiq. The actual rules to use is TBD.

Phase 3

Configuration

The configuration parameter hive.cbo.enable determines whether cost-based optimization is enabled or not.

Proposed Cost Model

Hive employs parallel-distributed query execution using Hadoop cluster. This implies for a given query operator tree different operators could be running on different nodes. Also same operator may be running in parallel on different nodes in the cluster, processing different partitions of the original relation. This parallel execution model induces high I/O and CPU costs. Hive query execution cost tends to be more I/O centric due to following reasons.

...

Space shortcuts

Child pages

Versions Compared

Old Version 5

New Version 6

Key

Abstract

Phase 1

Phase 3

Configuration

Proposed Cost Model

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 5

New Version 6

Key

Abstract

Phase 1

Phase 3

Configuration

Proposed Cost Model