Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Wiki Markup
h3. *Introduction*

This document describes changes to a) HiveQL, b) metastore schema, and c) metastore thrift API to support column level statistics in Hive. Please note that the document doesn’t describe the changes needed to persist histograms in the metastore yet. 

h3. *HiveQL changes*

HiveQL currently supports analyze command to compute statistics on  tables and partitions. HiveQL’s analyze command will be extended to trigger statistics computation on one or more column in a Hive table/partition. The necessary changes to HiveQL are as below,

Introduction

This document describes changes to a) HiveQL, b) metastore schema, and c) metastore thrift API to support column level statistics in Hive. Please note that the document doesn’t describe the changes needed to persist histograms in the metastore yet.

Proposed HiveQL changes

HiveQL currently supports analyze command to compute statistics on tables and partitions. HiveQL’s analyze command will be extended to trigger statistics computation on one or more column in a Hive table/partition. The necessary changes to HiveQL are as below,

...

analyze table t \[partition p\]  compute statistics for \[columns c,...\];

...

Proposed Metastore Schema

To persist column level statistics, we propose to add the following new tables,

...



h3. *Metastore Schema*

To persist column level statistics, we propose to add the following new tables,

CREATE TABLE TAB_COL_STATS

...


(

...


CS_ID NUMBER NOT NULL,

...


TBL_ID NUMBER NOT NULL,

...


COLUMN_NAME VARCHAR(128) NOT NULL,

...


COLUMN_TYPE VARCHAR(128) NOT NULL,

...


TABLE_NAME VARCHAR(128) NOT NULL,

...


DB_NAME VARCHAR(128) NOT NULL,

...



LOW_VALUE

...

 RAW,

...


HIGH_VALUE

...

 RAW,

...


NUM_NULLS

...

 BIGINT,

...


NUM_DISTINCTS

...

 BIGINT,

...


AVG_COL_LEN

...

 DOUBLE,

...


MAX_COL_LEN

...

 BIGINT,

...


NUM_TRUES

...

 BIGINT,

...


NUM_FALSES

...

 BIGINT,

...


LAST_ANALYZED

...

 BIGINT NOT NULL

...

)

ALTER TABLE COLUMN_STATISTICS ADD CONSTRAINT COLUMN_STATISTICS_PK PRIMARY KEY (CS_ID);

...



ALTER TABLE COLUMN_STATISTICS ADD CONSTRAINT COLUMN_STATISTICS_FK1  FOREIGN KEY (TBL_ID) REFERENCES TBLS (TBL_ID) INITIALLY DEFERRED ;

...



CREATE TABLE PART_COL_STATS

...


(

...


CS_ID NUMBER NOT NULL,

...


PART_ID NUMBER NOT NULL,

...



DB_NAME VARCHAR(128) NOT NULL,

...


COLUMN_NAME VARCHAR(128) NOT NULL,

...


COLUMN_TYPE VARCHAR(128) NOT NULL,

...


TABLE_NAME VARCHAR(128) NOT NULL,

...


PART_NAME VARCHAR(128) NOT NULL,

...



LOW_VALUE

...

 RAW,

...


HIGH_VALUE

...

 RAW,

...


NUM_NULLS

...

 BIGINT,

...


NUM_DISTINCTS

...

 BIGINT,

...


AVG_COL_LEN

...

 DOUBLE,

...


MAX_COL_LEN

...

 BIGINT,

...


NUM_TRUES

...

 BIGINT,

...


NUM_FALSES

...

 BIGINT,

...


LAST_ANALYZED

...

 BIGINT NOT NULL)

ALTER TABLE COLUMN_STATISTICS ADD CONSTRAINT COLUMN_STATISTICS_PK PRIMARY KEY (CS_ID);

...



ALTER TABLE COLUMN_STATISTICS ADD CONSTRAINT COLUMN_STATISTICS_FK1  FOREIGN KEY (PART_ID) REFERENCES PARTITIONS (PART_ID) INITIALLY DEFERRED;

...

Possible values for the histogram column are NONE, HEIGHT-BALANCED. Currently only NONE is a valid option. When we implement support for histograms, we will extend the metastore schema to persist the histogram buckets. We will check for the value of the histogram column in TAB_COL_STATS and PART_COL_STATS to decide if valid histogram buckets exist for the column in question.

Proposed Metastore Thrift API

We propose to add the following Thrift struct to transport column statistics,

struct ColumnStatistics

{1: string tableName,
2: string colName,
3: string dbName,
4: string partName,
5: i64 lowValue,
6: i64 highValue,
7: i64 numNulls,
8: i64 numDVs,
9: double avgColLen,
10: i64 maxColLen,
11: i64 numTrues,
12: i64 numFalses,
13: i64 lastAnalyzed,
14: string histogram,
15: bool isTblLevel }

Field histogram in the above thrift struct takes values of “NONE” or “HEIGHT-BALANCED”.

We propose to add the following Thrift APIs to persist, retrieve and delete column statistics,

...



h3. *Metastore Thrift API*

We propose to add the following Thrift structs to transport column statistics,

struct BooleanColumnStatsData {
1: required i64 numTrues,
2: required i64 numFalses,
3: required i64 numNulls
}

struct DoubleColumnStatsData {
1: required double lowValue,
2: required double highValue,
3: required i64 numNulls,
4: required i64 numDVs
}

struct LongColumnStatsData {
1: required i64 lowValue,
2: required i64 highValue,
3: required i64 numNulls,
4: required i64 numDVs
}

struct StringColumnStatsData {
1: required i64 maxColLen,
2: required double avgColLen,
3: required i64 numNulls,
4: required i64 numDVs
}

struct BinaryColumnStatsData {
1: required i64 maxColLen,
2: required double avgColLen,
3: required i64 numNulls
}

union ColumnStatisticsData {
1: BooleanColumnStatsData booleanStats,
2: LongColumnStatsData longStats,
3: DoubleColumnStatsData doubleStats,
4: StringColumnStatsData stringStats,
5: BinaryColumnStatsData binaryStats
}

struct ColumnStatisticsObj {
1: required string colName,
2: required string colType,
3: required ColumnStatisticsData statsData
}

struct ColumnStatisticsDesc {
1: required bool isTblLevel, 
2: required string dbName,
3: required string tableName,
4: optional string partName,
5: optional i64 lastAnalyzed
} 
  
struct ColumnStatistics {
1: required ColumnStatisticsDesc statsDesc,
2: required list<ColumnStatisticsObj> statsObj;
}

We propose to add the following Thrift APIs to persist, retrieve and delete column statistics,

bool update_table_column_statistics(1:ColumnStatistics stats_obj) throws

...

 (1:NoSuchObjectException o1, 
              2:InvalidObjectException o2, 3:MetaException o3

...

, 4:InvalidInputException o4)
bool update_partition_column_statistics

...

(1:ColumnStatistics stats_obj) throws (1:NoSuchObjectException o1, 
              2:InvalidObjectException o2, 3:MetaException o3, 4:InvalidInputException o4)

ColumnStatistics get_table_column_statistics(1:string db_name, 2:string tbl_name, 3:string col_name)

...

 throws
              (1:NoSuchObjectException o1, 2:MetaException o2, 3:InvalidInputException o3

...

, 4:InvalidObjectException o4) 
ColumnStatistics get_partition_column_statistics

...

(1:string db_name, 2:string tbl_name, 3:string part_name,
               4:string col_name) throws

...

 (1:NoSuchObjectException o1, 2:MetaException o2, 
               3:InvalidInputException o3

...

, 4:InvalidObjectException o4)
  
bool delete_partition_column_statistics

...

(1:string db_name, 2:string tbl_name, 3:string part_name, 4:string col_name)

...

 throws 
              (1:NoSuchObjectException o1, 2:MetaException o2, 3:InvalidObjectException o3, 
               4:InvalidInputException o4)

...


bool delete_table_column_statistics

...

(1:string db_name, 2:string tbl_name, 3:string col_name)

...

 throws 
              (1:NoSuchObjectException o1, 2:MetaException o2, 3:InvalidObjectException o3, 
               4:InvalidInputException o4)

...

             


Note that delete_column_statistics is needed to remove the entries from the meta-store when a table is dropped. Also note that currently Hive doesn’t support drop

...

 column.