Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

...

Statistics

...

in

...

Hive

Table of Contents

This document describes the support of statistics for Hive tables (see HIVE-33).

Motivation

Statistics such as the number of rows of a table or partition and the histograms of a particular interesting column are important in many ways. One of the key use cases of statistics is query optimization. Statistics serve as the input to the cost functions of the optimizer so that it can compare different plans and choose among them. Statistics may sometimes meet the purpose of the users' queries. Users can quickly get the answers for some of their queries by only querying stored statistics rather than firing long-running execution plans. Some examples are getting the quantile of the users' age distribution, the top 10 apps that are used by people, and the number of distinct sessions.

Scope

The first milestone in supporting statistics was to support table and partition level statistics. Table and partition statistics are now stored in Hive Metastore for either newly created or existing tables. The following statistics are currently supported for partitions:

  • Number of Rows
  • Number of files
  • Size in Bytes.
    For tables, the same statistics are supported with the addition of the number of partitions of the table.

Column level top K values can also be gathered a long with partition level statistics. See Top K Statistics.

Implementation

The way the statistics are calculated is similar for both newly created and existing tables.
For newly created tables, the job that creates a new table is a MapReduce job. During the creation, every mapper while copying the rows from the source table in the FileSink operator, gathers statistics for the rows it encounters and publishes them into a Database (possibly MySQL). At the end of the MapReduce job, published statistics are aggregated and stored in the MetaStore.
A similar process happens in the case of already existing tables, where a Map-only job is created and every mapper while processing the table in the TableScan operator, gathers statistics for the rows it encounters and the same process continues.

It is clear that there is a need for a database that stores temporary gathered statistics. Currently there are two implementations, one is using MySQL and the other is using HBase. There are two pluggable interfaces IStatsPublisher and IStatsAggregator that the developer can implement to support any other storage. The interfaces are listed below:

Code Block


{toc}


This document describes the support of statistics for Hive tables (see [HIVE-33|http://issues.apache.org/jira/browse/HIVE-33]).

h2. Motivation

Statistics such as the number of rows of a table or partition and the histograms of a particular interesting column are important in many ways. One of the key use cases of statistics is query optimization. Statistics serve as the input to the cost functions of the optimizer so that it can compare different plans and choose among them. Statistics may sometimes meet the purpose of the users' queries. Users can quickly get the answers for some of their queries by only querying stored statistics rather than firing long-running execution plans. Some examples are getting the quantile of the users' age distribution, the top 10 apps that are used by people, and the number of distinct sessions.

h2. Scope

The first milestone in supporting statistics was to support table and partition level statistics. Table and partition statistics are now stored in Hive Metastore for either newly created or existing tables. The following statistics are currently supported for partitions:
 * Number of Rows
 * Number of files
 * Size in Bytes.
For tables, the same statistics are supported with the addition of the number of partitions of the table.

Column level top K values can also be gathered a long with partition level statistics. See [Top K Statistics|https://cwiki.apache.org/confluence/display/Hive/Top+K+Stats].

h2. Implementation

The way the statistics are calculated is similar for both newly created and existing tables.
For newly created tables, the job that creates a new table is a MapReduce job. During the creation, every mapper while copying the rows from the source table in the FileSink operator, gathers statistics for the rows it encounters and publishes them into a Database (possibly [MySQL|http://www.mysql.com/]). At the end of the MapReduce job, published statistics are aggregated and stored in the MetaStore.
A similar process happens in the case of already existing tables, where a Map-only job is created and every mapper while processing the table in the TableScan operator, gathers statistics for the rows it encounters and the same process continues.

It is clear that there is a need for a database that stores temporary gathered statistics. Currently there are two implementations, one is using [MySQL|http://www.mysql.com/] and the other is using [HBase|http://wiki.apache.org/hadoop/Hbase]. There are two pluggable interfaces IStatsPublisher and IStatsAggregator that the developer can implement to support any other storage. The interfaces are listed below:

{code}

package org.apache.hadoop.hive.ql.stats;

import org.apache.hadoop.conf.Configuration;

/**
 * An interface for any possible implementation for publishing statics.
 */

public interface IStatsPublisher {

  /**
 * This method does the necessary initializations according to the implementation requirements.
   */
  public boolean init(Configuration hconf);

  /**
 * This method publishes a given statistic into a disk storage, possibly HBase or MySQL.
   *
 * rowID : a string identification the statistics to be published then gathered, possibly the table name + the partition specs.
   *
 * key : a string noting the key to be published. Ex: "numRows".
   *
 * value : an integer noting the value of the published key.
 * */
  public boolean publishStat(String rowID, String key, String value);

  /**
 * This method executes the necessary termination procedures, possibly closing all database connections.
   */
  public boolean terminate();

}

{code}

{code}
Code Block

package org.apache.hadoop.hive.ql.stats;

import org.apache.hadoop.conf.Configuration;

/**
 * An interface for any possible implementation for gathering statistics.
 */

public interface IStatsAggregator {

  /**
 * This method does the necessary initializations according to the implementation requirements.
   */
  public boolean init(Configuration hconf);

  /**
 * This method aggregates a given statistic from a disk storage.
 * After aggregation, this method does cleaning by removing all records from the disk storage that have the same given rowID.
   *
 * rowID : a string identification the statistic to be gathered, possibly the table name + the partition specs.
   *
 * key : a string noting the key to be gathered. Ex: "numRows".
   *
 * */
  public String aggregateStats(String rowID, String key);

  /**
 * This method executes the necessary termination procedures, possibly closing all database connections.
   */
  public boolean terminate();

}
{code}

h2. Usage

h3. Newly Created Tables

For newly created tables 

Usage

Newly Created Tables

For newly created tables and/or

...

partitions

...

(that

...

are

...

populated

...

through

...

the

...

INSERT

...

OVERWRITE

...

command),

...

statistics

...

are

...

automatically

...

computed

...

by

...

default.

...

The

...

user

...

has

...

to

...

explicitly

...

set

...

the

...

boolean

...

variable

...

hive.stats.autogather

...

to

...

false

...

so

...

that

...

statistics

...

are

...

not

...

automatically

...

computed

...

and

...

stored

...

into

...

Hive

...

MetaStore.

{
Code Block
}
set hive.stats.autogather=false;
{code}

The

...

user

...

can

...

also

...

specify

...

the

...

implementation

...

to

...

be

...

used

...

for

...

the

...

storage

...

of

...

temporary

...

statistics

...

setting

...

the

...

variable

...

hive.stats.dbclass

...

.

...

For

...

example,

...

to

...

set

...

HBase

...

as

...

the

...

implementation

...

(the

...

default

...

is

...

{{jdbc:derby}}of

...

temporary

...

statistics

...

storage

...

the

...

user

...

should

...

issue

...

the

...

following

...

command:

{
Code Block
}
set hive.stats.dbclass=hbase;
{code}

In

...

case

...

of

...

JDBC

...

implementations

...

of

...

temporary

...

stored

...

statistics

...

(ex.

...

Derby

...

or

...

MySQL),

...

the

...

user

...

should

...

specify

...

the

...

appropriate

...

connection

...

string

...

to

...

the

...

database

...

by

...

setting

...

the

...

variable

...

hive.stats.dbconnectionstring

...

.

...

Also

...

the

...

user

...

should

...

specify

...

the

...

appropriate

...

JDBC

...

driver

...

by

...

setting

...

the

...

variable

...

hive.stats.jdbcdriver

...

.

{
Code Block
}
set hive.stats.dbclass=jdbc:derby;
set hive.stats.dbconnectionstring="jdbc:derby:;databaseName=TempStatsStore;create=true";
set hive.stats.jdbcdriver="org.apache.derby.jdbc.EmbeddedDriver";
{code}

Queries

...

can

...

fail

...

to

...

collect

...

stats

...

completely

...

accurately.

...

There

...

is

...

a

...

setting

...

hive.stats.reliable

...

that

...

fails

...

queries

...

if

...

the

...

stats

...

can't

...

be

...

reliably

...

collected.

...

This

...

is

...

false

...

by

...

default.

...

Existing

...

Tables

...

For

...

existing

...

tables

...

and/or

...

partitions,

...

the

...

user

...

can

...

issue

...

the

...

ANALYZE

...

command

...

to

...

gather

...

statistics

...

and

...

write

...

them

...

into

...

Hive

...

MetaStore.

...

The

...

syntax

...

for

...

that

...

command

...

is

...

described

...

below:

{
Code Block
}
ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] COMPUTE STATISTICS [noscan];
{code}

When

...

the

...

user

...

issues

...

that

...

command,

...

he

...

may

...

or

...

may

...

not

...

specify

...

the

...

partition

...

specs.

...

If

...

the

...

user

...

doesn't

...

specify

...

any

...

partition

...

specs,

...

statistics

...

are

...

gathered

...

for

...

the

...

table

...

as

...

well

...

as

...

all

...

the

...

partitions

...

(if

...

any).

...

If

...

certain

...

partition

...

specs

...

are

...

specified,

...

then

...

statistics

...

are

...

gathered

...

for

...

only

...

those

...

partitions.

...

When

...

computing

...

statistics

...

across

...

all

...

partitions,

...

the

...

partition

...

columns

...

still

...

need

...

to

...

be

...

listed.

...


When

...

optional

...

parameter

...

noscan

...

is

...

specified,

...

the

...

command

...

won't

...

scan

...

files

...

so

...

that

...

it's

...

supposed

...

to

...

be

...

fast.

...

Instead

...

of

...

all

...

statistics,

...

it

...

just

...

gathers

...

the

...

following

...

statistics:

...

  • Number

...

  • of

...

  • files

...

  • Physical

...

  • size

...

  • in

...

  • bytes

...

Examples

Suppose table Table1 has 4 partitions with the following specs:

  • Partition1: (ds='2008-04-08',

...

  • hr=11)

...

  • Partition2:

...

  • (ds='2008-04-08',

...

  • hr=12)

...

  • Partition3:

...

  • (ds='2008-04-09',

...

  • hr=11)

...

  • Partition4:

...

  • (ds='2008-04-09',

...

  • hr=12)

...

and

...

the

...

user

...

issues

...

the

...

following

...

command:

{
Code Block
}
ANALYZE TABLE Table1 PARTITION(ds='2008-04-09', hr=11) COMPUTE STATISTICS;
{code}

then

...

statistics

...

are

...

gathered

...

for

...

partition3

...

(ds='2008-04-09',

...

hr=11)

...

only.

...

If

...

the

...

user

...

issues

...

the

...

command:

{
Code Block
}
ANALYZE TABLE Table1 PARTITION(ds='2008-04-09', hr) COMPUTE STATISTICS;
{code}

then

...

statistics

...

are

...

gathered

...

for

...

partitions

...

3

...

and

...

4

...

only.

...

If

...

the

...

user

...

issues

...

the

...

command:

{
Code Block
}
ANALYZE TABLE Table1 PARTITION(ds, hr) COMPUTE STATISTICS;
{code}

then

...

statistics

...

are

...

gathered

...

for

...

the

...

4

...

partitions.

...

For

...

a

...

non-partitioned

...

table,

...

you

...

can

...

issue

...

the

...

command:

{
Code Block
}
ANALYZE TABLE Table1 COMPUTE STATISTICS;
{code}

to

...

gather

...

statistics

...

of

...

the

...

table.

...

If

...

Table1

...

is

...

a

...

partitioned

...

table,

...

you

...

have

...

to

...

specify

...

partition

...

specifications

...

like

...

above.

...

Otherwise

...

a

...

semantic

...

analyzer

...

exception

...

will

...

be

...

thrown.

...

The

...

user

...

can

...

view

...

the

...

stored

...

statistics

...

by

...

issuing

...

the

...

DESCRIBE

...

command.

...

Statistics

...

are

...

stored

...

in

...

the

...

Parameters

...

array.

...

Suppose

...

the

...

user

...

issues

...

the

...

analyze

...

command

...

for

...

the

...

whole

...

table

...

Table1,

...

then

...

issues

...

the

...

command:

{
Code Block
}
DESCRIBE EXTENDED TABLE1;
{code}

then

...

among

...

the

...

output,

...

the

...

following

...

would

...

be

...

displayed:

{
Code Block
}
 ... , parameters:{numPartitions=4, numFiles=16, numRows=2000, totalSize=16384, ...}, ....
{code}

If

...

the

...

user

...

issues

...

the

...

command:

{
Code Block
}
DESCRIBE EXTENDED TABLE1 PARTITION(ds='2008-04-09', hr=11);
{code}

then

...

among

...

the

...

output,

...

the

...

following

...

would

...

be

...

displayed:

{
Code Block
}
 ... , parameters:{numFiles=4, numRows=500, totalSize=4096, ...}, ....
{code}

If

...

the

...

user

...

issues

...

the

...

command:

{
Code Block
}
ANALYZE TABLE Table1 PARTITION(ds='2008-04-09', hr) COMPUTE STATISTICS noscan;
{code}

then

...

statistics

...

,

...

number

...

of

...

files

...

and

...

physical

...

size

...

in

...

bytes,

...

are

...

gathered

...

for

...

partitions

...

3

...

and

...

4

...

only.

...

Current

...

Status

...

(JIRA)

...

Jira Issues

...

url

...

https://issues.apache.org/jira/sr/jira.issueviews:searchrequest-xml/temp/SearchRequest.xml?jqlQuery=project+%3D+HIVE+AND+component+in+%28%22Statistics%22%29&tempMax=1000

...