...
In addition to the partition statistics, column level top K values can also be estimated for Hive tables.
The name and top K values of the most skewed column is stored in the partition or non-partitioned table’s skewed information, if user did not specify skew. This works for both newly created and existing tables.
The algorithm for computing top K is based on this paper: top-k.
...
Code Block |
---|
... /** * This method aggregates top K statistics. * * */ public List<String> aggregateStatsTopK(String keyPrefix, String statType); ... |
Usage
Top K statistic is statistics are not computed enabled by default. The user can set the boolean variable hive.stats.topk.collect to be true to enable computing top K and putting top K into skewed information.
...