You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Introduction

This document describes changes to a) HiveQL, b) metastore schema, and c) metastore thrift API to support column level statistics in Hive. Please note that the document doesn’t describe the changes needed to persist histograms in the metastore yet. This is still an open item.

Proposed HiveQL changes

HiveQL currently supports analyze command to compute statistics on tables and partitions. HiveQL’s analyze command will be extended to trigger statistics computation on a column in a Hive table. The necessary changes to HiveQL are as below,

analyze table t [partition p] [column c] compute statistics;

Proposed Metastore Schema

To persist column level statistics, we propose to add the following new tables,

CREATE TABLE TAB_COL_STATS
(
CS_ID NUMBER NOT NULL,
TBL_ID NUMBER NOT NULL,
COLUMN_NAME VARCHAR(128) NOT NULL,
TABLE_NAME VARCHAR(128) NOT NULL,
DB_NAME VARCHAR(128) NOT NULL,

LOW_VALUE NUMBER,
HIGH_VALUE NUMBER,
NUM_NULLS NUMBER,
NUM_DISTINCTS NUMBER,
AVG_COL_LEN NUMBER,
MAX_COL_LEN NUMBER,
NUM_TRUES NUMBER,
NUM_FALSES NUMBER,
LAST_ANALYZED NUMBER NOT NULL ,
HISTOGRAM VARCHAR2(15) NOT NULL)

ALTER TABLE COLUMN_STATISTICS ADD CONSTRAINT COLUMN_STATISTICS_PK PRIMARY KEY (CS_ID);

ALTER TABLE COLUMN_STATISTICS ADD CONSTRAINT COLUMN_STATISTICS_FK1 FOREIGN KEY (TBL_ID) REFERENCES TBLS (TBL_ID) INITIALLY DEFERRED ;

CREATE TABLE PART_COL_STATS
(
CS_ID NUMBER NOT NULL,
PART_ID NUMBER NOT NULL,

DB_NAME VARCHAR(128) NOT NULL,
COLUMN_NAME VARCHAR(128) NOT NULL,
TABLE_NAME VARCHAR(128) NOT NULL,
PART_NAME VARCHAR(128) NOT NULL,

LOW_VALUE NUMBER,
HIGH_VALUE NUMBER,
NUM_NULLS NUMBER,
NUM_DISTINCTS NUMBER,
AVG_COL_LEN NUMBER,
MAX_COL_LEN NUMBER,
NUM_TRUES NUMBER,
NUM_FALSES NUMBER,
LAST_ANALYZED NUMBER NOT NULL ,
HISTOGRAM VARCHAR2(15) NOT NULL)

ALTER TABLE COLUMN_STATISTICS ADD CONSTRAINT COLUMN_STATISTICS_PK PRIMARY KEY (CS_ID);

ALTER TABLE COLUMN_STATISTICS ADD CONSTRAINT COLUMN_STATISTICS_FK1 FOREIGN KEY (PART_ID) REFERENCES PARTITIONS (PART_ID) INITIALLY DEFERRED ;

Possible values for the histogram column are NONE, HEIGHT-BALANCED.

Proposed Metastore Thrift API

We propose to add the following Thrift struct to transport column statistics,

struct ColumnStatistics

(1: string tableName,2: string colName, 3: string dbName, 4: string partName, 5: i64 lowValue, 6: i64 highValue, 7: i64 numNulls, 8: i64 numDVs, 9: double avgColLen, 10: i64 maxColLen, 11: i64 numTrues, 12: i64 numFalses, 13: i64 lastAnalyzed, 14: string histogram, 15: bool isTblLevel )

Field histogram in the above thrift struct takes values of “NONE” or “HEIGHT-BALANCED”.

We propose to add the following Thrift APIs to persist, retrieve and delete column statistics,

bool write_column_statistics (1:ColumnStatistics stats_obj) throws

(1:NoSuchObjectException o1, 2:InvalidObjectException o2, 3:MetaException o3)

ColumnStatistics get_column_statistics_by_table (1:string db_name, 2:string tbl_name,

3:string col_name) throws
(1:NoSuchObjectException o1, 2:MetaException o2, 3:InvalidInputException o3)

ColumnStatistics get_column_statistics_by_partition (1:string db_name, 2:string tbl_name,

3:string part_name, 4:string col_name) throws (1:NoSuchObjectException o1, 2:MetaException o2,
3:InvalidInputException o3)

bool delete_column_statistics_by_partition (1:string db_name, 2:string tbl_name,

3:string part_name, 4:string col_name) throws
(1:NoSuchObjectException o1, 2:MetaException o2, 3:InvalidObjectException o3,
4:InvalidInputException o4)

bool delete_column_statistics_by_table (1:string db_name, 2:string tbl_name, 3:string col_name) throws
(1:NoSuchObjectException o1, 2:MetaException o2, 3:InvalidObjectException o3,
4:InvalidInputException o4)

Note that delete_column_statistics is needed to remove the entries from the meta-store when a table is dropped. Also note that currently Hive doesn’t support drop column.

If null is passed in the place of col_name to the delete column statistics APIs, column statistics for all columns in the table/partition is deleted. This is provided for ease of delete during a drop table/drop partition operation.

  • No labels