Page History

...

Info
These functions can be used without GROUP BY as well.

Return Type	Name(Signature)	Description	OSS
bigint	count(*) count(expr) count(DISTINCT expr[, expr...])	count(*) - Returns the total number of retrieved rows, including rows containing NULL values. count(expr) - Returns the number of rows for which the supplied expression is non-NULL. count(DISTINCT expr[, expr]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL. Execution of this can be optimized with hive.optimize.distinct.rewrite.	GenericUDAFCount
double	sum(col), sum(DISTINCT col)	Returns the sum of the elements in the group or the sum of the distinct values of the column in the group.	GenericUDAFSum
double	avg(col), avg(DISTINCT col)	Returns the average of the elements in the group or the average of the distinct values of the column in the group.	GenericUDAFAverage
double	min(col)	Returns the minimum of the column in the group.	GenericUDAFMin
double	max(col)	Returns the maximum value of the column in the group.	GenericUDAFMax
double	variance(col), var_pop(col)	Returns the variance of a numeric column in the group.	GenericUDAFVariance
double	var_samp(col)	Returns the unbiased sample variance of a numeric column in the group.	GenericUDAFVarianceSample
double	stddev_pop(col)	Returns the standard deviation of a numeric column in the group.	GenericUDAFStd
double	stddev_samp(col)	Returns the unbiased sample standard deviation of a numeric column in the group.	GenericUDAFStdSample
double	covar_pop(col1, col2)	Returns the population covariance of a pair of numeric columns in the group.	GenericUDAFCovariance
double	covar_samp(col1, col2)	Returns the sample covariance of a pair of numeric columns in the group.	GenericUDAFCovarianceSample
double	corr(col1, col2)	Returns the Pearson coefficient of correlation of a pair of numeric columns in the group.	GenericUDAFCorrelation
double	percentile(bigint col, p)	Returns the exact p^th percentile of a column in the group (does not work with floating point types). p must be between 0 and 1. NOTE: A true percentile can only be computed for integer values. Use PERCENTILE_APPROX if your input is non-integral.	UDAFPercentile
array<double>	percentile(bigint col, array(p1 [, p2]...))	Returns the exact percentiles p₁, p₂, ... of a column in the group (does not work with floating point types). p_i must be between 0 and 1. NOTE: A true percentile can only be computed for integer values. Use PERCENTILE_APPROX if your input is non-integral.
double	percentile_approx(double col, p [, B])	Returns an approximate p^th percentile of a numeric column (including floating point types) in the group. The B parameter controls approximation accuracy at the cost of memory. Higher values yield better approximations, and the default is 10,000. When the number of distinct values in col is smaller than B, this gives an exact percentile value.	GenericUDAFPercentileApprox
array<double>	percentile_approx(double col, array(p1 [, p2]...) [, B])	Same as above, but accepts and returns an array of percentile values instead of a single one.
double	regr_avgx(independent, dependent)	Equivalent to avg(dependent).
double	regr_avgy(independent, dependent)	Equivalent to avg(independent).
double	regr_count(independent, dependent)	Returns the number of non-null pairs used to fit the linear regression line.
double	regr_intercept(independent, dependent)	Returns the y-intercept of the linear regression line, i.e. the value of b in the equation dependent = a * independent + b.
double	regr_r2(independent, dependent)	Returns the coefficient of determination for the regression.
double	regr_slope(independent, dependent)	Returns the slope of the linear regression line, i.e. the value of an in the equation dependent = a * independent + b.
double	regr_sxx(independent, dependent)	Equivalent to regr_count(independent, dependent) * var_pop(dependent).
double	regr_sxy(independent, dependent)	Equivalent to regr_count(independent, dependent) * covar_pop(independent, dependent).
double	regr_syy(independent, dependent)	Equivalent to regr_count(independent, dependent) * var_pop(independent).
array<struct {`'x','y'`}>	histogram_numeric(col, b)	Computes a histogram of a numeric column in the group using b non-uniformly spaced bins. The output is an array of size b of double-valued (x,y) coordinates that represent the bin centers and heights	GenericUDAFHistogramNumeric
array	collect_set(col)	Returns a set of objects with duplicate elements eliminated.	GenericUDAFCollectSet
array	collect_list(col)	Returns a list of objects with duplicates.	GenericUDAFCollectList
int	ntile(integer x)	Divides an ordered partition into `x` groups called buckets and assign a bucket number to each row in the partition. This allows easy calculation of tertiles, quartiles, deciles, percentiles, and other common summary statistics.	GenericUDAFNTile

Tip
Most of the UDAFs ignore NULL values.

...

Space shortcuts

Child pages

Versions Compared

Old Version 52

New Version 53

Key