Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In addition to the partition statistics, column level top K values can also be estimated for Hive tables.
The name and top K values of the most skewed column is stored in the partition or non-partitioned table’s skewed information, if user did not specify skew. This works for both newly created and existing tables.
The algorithm for computing top K is based on this paper: top-k.

...

Code Block
...
  /**
 * This method aggregates top K statistics.
   *
 * */
  public List<String> aggregateStatsTopK(String keyPrefix, String statType);
...

Usage

Top K statistic is statistics are not computed enabled by default. The user can set the boolean variable hive.stats.topk.collect to be true to enable computing top K and putting top K into skewed information.

...