Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction

This document describes changes to a) HiveQL, b) metastore schema, and c) metastore thrift API to support column level statistics in Hive. Please note that the document doesn’t describe the changes needed to persist histograms in the metastore yet.

Proposed HiveQL changes

HiveQL currently supports analyze command to compute statistics on tables and partitions. HiveQL’s analyze command will be extended to trigger statistics computation on one or more column in a Hive table/partition. The necessary changes to HiveQL are as below,

Wiki Markup
analyze table t \[partition p\]  compute statistics for \[columns c,...\];

Proposed Metastore Schema

To persist column level statistics, we propose to add the following new tables,

...

Possible values for the histogram column are NONE, HEIGHT-BALANCED. Currently only NONE is a valid option. When we implement support for histograms, we will extend the metastore schema to persist the histogram buckets. We will check for the value of the histogram column in TAB_COL_STATS and PART_COL_STATS to decide if valid histogram buckets exist for the column in question.

Proposed Metastore Thrift API

We propose to add the following Thrift struct to transport column statistics,

...