Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: link to Top K Stats doc, edits forgotting in previous update

...

This document describes changes to a) HiveQL, b) metastore schema, and c) metastore thrift Thrift API to support column level statistics in Hive. Please note that the document doesn’t describe the changes needed to persist histograms in the metastore yet.For general information about Hive statistics, see Statistics in Hive.

Info
titleVersion information

Column statistics are introduced in Hive 0.10.0 by HIVE-1362. This is the design document.

For general information about Hive statistics, see Statistics in Hive. For information about top K statistics, see Column Level Top K Statistics.

HiveQL changes

HiveQL currently supports the analyze command to compute statistics on tables and partitions. HiveQL’s analyze command will be extended to trigger statistics computation on one or more column in a Hive table/partition. The necessary changes to HiveQL are as below,

analyze table t [partition p] compute statistics for [columns c,...];

Please note that table and column aliases are not supported in the analyze statement.

...

We propose to add the following Thrift structs to transport column statistics,:

struct BooleanColumnStatsData {
1: required i64 numTrues,
2: required i64 numFalses,
3: required i64 numNulls
}

...

We propose to add the following Thrift APIs to persist, retrieve and delete column statistics,:

bool update_table_column_statistics(1:ColumnStatistics stats_obj) throws (1:NoSuchObjectException o1,
2:InvalidObjectException o2, 3:MetaException o3, 4:InvalidInputException o4)
bool update_partition_column_statistics(1:ColumnStatistics stats_obj) throws (1:NoSuchObjectException o1,
2:InvalidObjectException o2, 3:MetaException o3, 4:InvalidInputException o4)

...

Note that delete_column_statistics is needed to remove the entries from the meta-store metastore when a table is dropped. Also note that currently Hive doesn’t support drop column.

Note that in V1 of the project, we will support only scalar statistics. Furthermore, we will support only static partitions, i.e., the both the partition key and partition value should be specified in the analyze command. In a following version, we will add support for height balanced histograms as well as support for dynamic partitions in the analyze command for column level statistics.