Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Document HIVE-1653 and the actual behavior of HIVE-1361

...

This document describes the support of statistics for Hive tables (JIRA-see HIVE-33).

Motivation

Statistics such as the number of rows of a table or partition and the histograms of a particular interesting column are important in many ways. One of the key use cases of statistics is query optimization. Statistics serve as the input to the cost functions of the optimizer so that it can compare different plans and choose among them. Statistics may sometimes meet the purpose of the users' queries. Users can quickly get the answers for some of their queries by only querying stored statistics rather than firing long-running execution plans. Some examples are getting the quantile of the users' age distribution, the top 10 apps that are used by people, and the number of distinct sessions.

...

  • Number of Rows
  • Number of files
  • Size in Bytes.
    For tables, the same statistics are supported with the addition of the number of partitions of the table.

Implementation

The way the statistics are calculated is similar for both newly created and existing tables.
For newly created tables, the job that creates a new table is a MapReduce job. During the creation, every mapper while copying the rows from the source table in the FileSink operator, gathers statistics for the rows it encounters and publishes them into a Database (possibly MySQL). At the end of the MapReduce job, published statistics are aggregated and stored in the MetaStore.
A similar process happens in the case of already existing tables, where a Map-only job is created and every mapper while processing the table in the TableScan operator, gathers statistics for the rows it encounters and the same process continues.

...

For newly created tables and/or partitions (that are populated through the INSERT OVERWRITE command), statistics are not automatically computed by default. The user has to explicitly set the boolean variable hive.stats.autogather to true false so that statistics are not automatically computed and stored into Hive MetaStore.

Code Block
set hive.stats.autogather=truefalse;

The user can also specify the implementation to be used for the storage of temporary statistics setting the variable hive.stats.dbclass. For example, to set HBase as the implementation (the default is {{jdbc:derby}}of temporary statistics storage the user should issue the following command:

...

Code Block
set hive.stats.dbclass=jdbc:derby;
set hive.stats.dbconnectionstring="jdbc:derby:;databaseName=TempStatsStore;create=true";
set hive.stats.jdbcdriver="org.apache.derby.jdbc.EmbeddedDriver";

Queries can fail to collect stats completely accurately. There is a setting hive.stats.reliable that fails queries if the stats can't be reliably collected. This is false by default.

Existing Tables

For existing tables and/or partitions, the user can issue the ANALYZE command to gather statistics and write them into Hive MetaStore. The syntax for that command is described below:

...