Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This document describes changes to a) HiveQL, b) metastore schema, and c) metastore thrift API to support column level statistics in Hive. Please note that the document doesn’t describe the changes needed to persist histograms in the metastore yet. This is still an open item.

Proposed HiveQL changes

HiveQL currently supports analyze command to compute statistics on tables and partitions. HiveQL’s analyze command will be extended to trigger statistics computation on a one or more column in a Hive table/partition. The necessary changes to HiveQL are as below,

Wiki Markup
analyze table t \[partition p\]  compute statistics for \[columncolumns c,...\] compute statistics;

Proposed Metastore Schema

...

CREATE TABLE TAB_COL_STATS
(
CS_ID NUMBER NOT NULL,
TBL_ID NUMBER NOT NULL,
COLUMN_NAME VARCHAR(128) NOT NULL,
COLUMN_TYPE VARCHAR(128) NOT NULL,
TABLE_NAME VARCHAR(128) NOT NULL,
DB_NAME VARCHAR(128) NOT NULL,

...

DB_NAME VARCHAR(128) NOT NULL,
COLUMN_NAME VARCHAR(128) NOT NULL,
COLUMN_TYPE VARCHAR(128) NOT NULL,
TABLE_NAME VARCHAR(128) NOT NULL,
PART_NAME VARCHAR(128) NOT NULL,

...

Possible values for the histogram column are NONE, HEIGHT-BALANCED. Currently only NONE is a valid option. When we implement support for histograms, we will extend the metastore schema to persist the histogram buckets. We will check for the value of the histogram column in TAB_COL_STATS and PART_COL_STATS to decide if valid histogram buckets exist for the column in question.

Proposed Metastore Thrift API

...

If null is passed in the place of col_name to the delete column statistics APIs, column statistics, when present, for all columns in the table/partition is deleted. This is provided for ease of delete to make deletion easy during a drop table/drop partition operation.