Apache DataSketches (https://datasketches.apache.org/) is integrated into Hive via HIVE-22939.
This enables various kind of sketch operations thru regular sql statement.
Sketch functions
Naming convention
All sketch functions are registered using the following naming convention:
ds_{sketchType}_{functionName}
For example we have a function called: ds_hll_estimate which could be used to estimate the distinct values from an hll sketch.
sketchType
For detailed info about the sketches themself please refer to the datasketches site!
- frequency
- hll
- cpc
- theta
- frequent items
- freq
- histograms
- kll
functionName
name | description |
---|---|
sketch | generates sketch data from input |
estimate | computes the estimate for frequency related sketches |
union | aggregate function to merge multiple sketches |
union_f | unions 2 sketches given in the arguments |
n | number of elements |
cdf | cumulative distribution |
rank | estimates the rank of the given element; returns a value in the range of 0~1 |
intersect | aggregate to intersect multiple sketches |
intersect_f | intersect 2 sketches given in the arguments |
stringify | returns the the sketch in a more readable form |
List declared sketch functions
Given that we have ~60 functions registered I would recommend to also consider listing/getting info about a single udf.
You could list all functions prefixed by ds_ using:
show functions like 'ds_%';
And you can access the description of a function like:
desc function ds_freq_sketch;