...
=
...
Hive
...
User
...
Defined
...
Functions
...
=
...
Table of Contents |
---|
In the CLI,
...
use
...
the
...
commands
...
below
...
to
...
show
...
the
...
latest
...
documentation:
Code Block |
---|
} SHOW FUNCTIONS; DESCRIBE FUNCTION <function_name>; {code} h2. |
Built-in
...
Operators
...
Relational
...
Operators
...
The
...
following
...
operators
...
compare
...
the
...
passed
...
operands
...
and
...
generate
...
a
...
TRUE
...
or
...
FALSE
...
value
...
depending
...
on
...
whether
...
the
...
comparison
...
between
...
the
...
operands
...
holds.
...
Operator | Operand types | Description |
A = B | All primitive types | TRUE if expression A is equal to expression B otherwise FALSE |
A == B | None! | Fails because of invalid syntax. SQL uses =, not == |
A <> B | All primitive types | NULL if A or B is NULL, TRUE if expression A is NOT equal to expression B otherwise FALSE |
A < B | All primitive types | NULL if A or B is NULL, TRUE if expression A is less than expression B otherwise FALSE |
A <= B | All primitive types | NULL if A or B is NULL, TRUE if expression A is less than or equal to expression B otherwise FALSE |
A > B | All primitive types | NULL if A or B is NULL, TRUE if expression A is greater than expression B otherwise FALSE |
A >= B | All primitive types | NULL if A or B is NULL, TRUE if expression A is greater than or equal to expression B otherwise FALSE |
A IS NULL | all types | TRUE if expression A evaluates to NULL otherwise FALSE |
A IS NOT NULL | All types | FALSE if expression A evaluates to NULL otherwise TRUE |
A LIKE B | strings | NULL if A or B is NULL, TRUE if string A matches the SQL simple regular expression B, otherwise FALSE. The comparison is done character by character. The _ character in B matches any character in A(similar to . in posix regular expressions) while the % character in B matches an arbitrary number of characters in A(similar to .* in posix regular expressions) e.g. 'foobar' like 'foo' evaluates to FALSE where as 'foobar' like 'foo_ _ _' evaluates to TRUE and so does 'foobar' like 'foo%' |
A RLIKE B | strings | NULL if A or B is NULL, TRUE if string A matches the Java regular expression B(See Java regular expressions syntax), otherwise FALSE e.g. 'foobar' rlike 'foo' evaluates to FALSE where as 'foobar' rlike '^f.*r$' evaluates to TRUE |
A REGEXP B | strings | Same as RLIKE |
Arithmetic Operators
The following operators support various common arithmetic operations on the operands. All return number types; if any of the operands are NULL, then the result is also NULL.
Operator | Operand types | Description |
A + B | All number types | Gives the result of adding A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. e.g. since every integer is a float, therefore float is a containing type of integer so the + operator on a float and an int will result in a float. |
A - B | All number types | Gives the result of subtracting B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A * B | All number types | Gives the result of multiplying A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. Note that if the multiplication causing overflow, you will have to cast one of the operators to a type higher in the type hierarchy. |
A / B | All number types | Gives the result of dividing B from A. The result is a double type. |
A % B | All number types | Gives the reminder resulting from dividing A by B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A & B | All number types | Gives the result of bitwise AND of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A | B | All number types | Gives the result of bitwise OR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A ^ B | All number types | Gives the result of bitwise XOR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
~A | All number types | Gives the result of bitwise NOT of A. The type of the result is the same as the type of A. |
Logical Operators
The following operators provide support for creating logical expressions. All of them return boolean TRUE, FALSE, or NULL depending upon the boolean values of the operands. NULL behaves as an "unknown" flag, so if the result depends on the state of an unknown, the result itself is unknown.
Operator | Operand types | Description |
A AND B | boolean | TRUE if both A and B are TRUE, otherwise FALSE. NULL if A or B is NULL |
A && B | boolean | Same as A AND B |
A OR B | boolean | TRUE if either A or B or both are TRUE; FALSE OR NULL is NULL; otherwise FALSE |
A || B | boolean | Same as A OR B |
NOT A | boolean | TRUE if A is FALSE or NULL if A is NULL. Otherwise FALSE. |
! A | boolean | Same as NOT A |
Complex Type Constructors
The following functions construct instances of complex types.
Constructor Function | Operands | Description |
map | (key1, value1, key2, value2, ...) | Creates a map with the given key/value pairs |
struct | (val1, val2, val3, ...) | Creates a struct with the given field values. Struct field names will be col1, col2, ... |
array | (val1, val2, ...) | Creates an array with the given elements |
Operators on Complex Types
The following operators provide mechanisms to access elements in Complex Types
Operator | Operand types | Description |
An | A is an Array and n is an int | Returns the nth element in the array A. The first element has index 0 e.g. if A is an array comprising of 'foo', 'bar' then A0 returns 'foo' and A1 returns 'bar' |
Mkey | M is a Map<K, V> and key has type K | Returns the value corresponding to the key in the map e.g. if M is a map comprising of {'f' -> 'foo', 'b' -> 'bar', 'all' -> 'foobar'} then M'all' returns 'foobar' |
S.x | S is a struct | Returns the x field of S. e.g for struct foobar {int foo, int bar} foobar.foo returns the integer stored in the foo field of the struct. |
Built-in Functions
Mathematical Functions
The following built-in mathematical functions are supported in hive; most return NULL when the argument(s) are NULL:
Return Type | Name(Signature) | Description |
BIGINT | round(double a) | Returns the rounded BIGINT value of the double |
DOUBLE | round(double a, int d) | Returns the double rounded to d decimal places |
BIGINT | floor(double a) | Returns the maximum BIGINT value that is equal or less than the double |
BIGINT | ceil(double a), ceiling(double a) | Returns the minimum BIGINT value that is equal or greater than the double |
double | rand(), rand(int seed) | Returns a random number (that changes from row to row) that is distributed uniformly from 0 to 1. Specifiying the seed will make sure the generated random number sequence is deterministic. |
double | exp(double a) | Returns e^a where e is the base of the natural logarithm |
double | ln(double a) | Returns the natural logarithm of the argument |
double | log10(double a) | Returns the base-10 logarithm of the argument |
double | log2(double a) | Returns the base-2 logarithm of the argument |
double | log(double base, double a) | Return the base "base" logarithm of the argument |
double | pow(double a, double p) power(double a, double p) | Return a^p |
double | sqrt(double a) | Returns the square root of a |
string | bin(BIGINT a) | Returns the number in binary format (see [http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_bin |
...
]) |
...
string | hex(BIGINT |
...
a) |
...
hex(string |
...
a) |
...
If |
...
the |
...
argument |
...
is |
...
an |
...
int, |
...
hex |
...
returns |
...
the |
...
number |
...
as |
...
a |
...
string |
...
in |
...
hex |
...
format. |
...
Otherwise |
...
if |
...
the |
...
number |
...
is |
...
a |
...
string, |
...
it |
...
converts |
...
each |
...
character |
...
into |
...
its |
...
hex |
...
representation |
...
and |
...
returns |
...
the |
...
resulting |
...
string. |
...
(see |
...
[ |
...
...
) |
...
string | unhex(string |
...
a) |
...
Inverse |
...
of |
...
hex. |
...
Interprets |
...
each |
...
pair |
...
of |
...
characters |
...
as |
...
a |
...
hexidecimal |
...
number |
...
and |
...
converts |
...
to |
...
the |
...
character |
...
represented |
...
by |
...
the |
...
number. |
...
string | conv(BIGINT |
...
num, |
...
int |
...
from_base, |
...
int |
...
to_base) |
...
Converts |
...
a |
...
number |
...
from |
...
a |
...
given |
...
base |
...
to |
...
another |
...
(see |
...
[ |
...
...
]) |
...
double | abs(double |
...
a) |
...
Returns |
...
the |
...
absolute |
...
value |
...
int |
...
double |
...
pmod(int |
...
a, |
...
int |
...
b) |
...
pmod(double |
...
a, |
...
double |
...
b) |
...
Returns |
...
the |
...
positive |
...
value |
...
of |
...
a |
...
mod |
...
b |
...
double | sin(double |
...
a) |
...
Returns |
...
the |
...
sine |
...
of |
...
a |
...
(a |
...
is |
...
in |
...
radians) |
...
double | asin(double |
...
a) |
...
Returns |
...
the |
...
arc |
...
sin |
...
of |
...
x |
...
if |
...
-1<=a<=1 |
...
or |
...
null |
...
otherwise |
...
double | cos(double |
...
a) |
...
Returns |
...
the |
...
cosine |
...
of |
...
a |
...
(a |
...
is |
...
in |
...
radians) |
...
double | acos(double |
...
a) |
...
Returns |
...
the |
...
arc |
...
cosine |
...
of |
...
x |
...
if |
...
-1<=a<=1 |
...
or |
...
null |
...
otherwise |
...
int |
...
double |
...
positive(int |
...
a) |
...
positive(double |
...
a) |
...
Returns |
...
a |
...
int |
...
double |
...
negative(int |
...
a) |
...
negative(double |
...
a) |
...
Returns |
...
-a |
...
Collection Functions
The following built-in
...
collection
...
functions
...
are
...
supported
...
in
...
hive:
...
Return |
...
Type |
...
Name(Signature) |
...
Description | |
int | size(Map<K.V>) |
...
Returns |
...
the |
...
number |
...
of |
...
elements |
...
in |
...
the |
...
map |
...
type |
...
int | size(Array<T>) |
...
Returns |
...
the |
...
number |
...
of |
...
elements |
...
in |
...
the |
...
array |
...
type |
Type Conversion Functions
The following type conversion functions are supported in hive:
Return Type | Name(Signature) | Description |
Expected "=" to follow "type" | cast(expr as <type>) | Converts the results of the expression expr to <type> e.g. cast('1' |
...
as |
...
BIGINT) |
...
will |
...
convert |
...
the |
...
string |
...
'1' |
...
to |
...
it |
...
integral |
...
representation. |
...
A |
...
null |
...
is |
...
returned |
...
if |
...
the |
...
conversion |
...
does |
...
not |
...
succeed. |
...
Date Functions
The following built-in
...
date
...
functions
...
are
...
supported
...
in
...
hive:
...
string |
...
from_unixtime(bigint |
...
unixtime |
...
...
...
...
) |
...
Converts |
...
the |
...
number |
...
of |
...
seconds |
...
from |
...
unix |
...
epoch |
...
(1970-01-01 |
...
00:00:00 |
...
UTC) |
...
to |
...
a |
...
string |
...
representing |
...
the |
...
timestamp |
...
of |
...
that |
...
moment |
...
in |
...
the |
...
current |
...
system |
...
time |
...
zone |
...
in |
...
the |
...
format |
...
of |
...
"1970-01-01 |
...
00:00:00" |
...
bigint | unix_timestamp() |
...
Gets |
...
current |
...
time |
...
stamp |
...
using |
...
the |
...
default |
...
time |
...
zone. |
...
bigint | unix_timestamp(string |
...
date) |
...
Converts |
...
time |
...
string |
...
in |
...
format |
...
|
...
|
...
to |
...
Unix |
...
time |
...
stamp, |
...
return |
...
0 |
...
if |
...
fail: |
...
unix_timestamp('2009-03-20 |
...
11:30:01') |
...
= |
...
1237573801 |
...
bigint | unix_timestamp(string |
...
date, |
...
string |
...
pattern) |
...
Convert |
...
time |
...
string |
...
with |
...
given |
...
pattern |
...
(see |
...
...
]) |
...
to |
...
Unix |
...
time |
...
stamp, |
...
return |
...
0 |
...
if |
...
fail: |
...
unix_timestamp('2009-03-20', |
...
'yyyy-MM-dd') |
...
= |
...
1237532400 |
...
string | to_date(string |
...
timestamp) |
...
Returns |
...
the |
...
date |
...
part |
...
of |
...
a |
...
timestamp |
...
string: |
...
to_date("1970-01-01 |
...
00:00:00") |
...
= |
...
"1970-01-01" |
...
int | year(string |
...
date) |
...
Returns |
...
the |
...
year |
...
part |
...
of |
...
a |
...
date |
...
or |
...
a |
...
timestamp |
...
string: |
...
year("1970-01-01 |
...
00:00:00") |
...
= |
...
1970, |
...
year("1970-01-01") |
...
= |
...
1970 |
...
int | month(string |
...
date) |
...
Returns |
...
the |
...
month |
...
part |
...
of |
...
a |
...
date |
...
or |
...
a |
...
timestamp |
...
string: |
...
month("1970-11-01 |
...
00:00:00") |
...
= |
...
11, |
...
month("1970-11-01") |
...
= |
...
11 |
...
int | day(string |
...
date) |
...
dayofmonth(date) |
...
Return |
...
the |
...
day |
...
part |
...
of |
...
a |
...
date |
...
or |
...
a |
...
timestamp |
...
string: |
...
day("1970-11-01 |
...
00:00:00") |
...
= |
...
1, |
...
day("1970-11-01") |
...
= |
...
1 |
...
int | hour(string |
...
date) |
...
Returns |
...
the |
...
hour |
...
of |
...
the |
...
timestamp: |
...
hour('2009-07-30 |
...
12:58:59') |
...
= |
...
12, |
...
hour('12:58:59') |
...
= |
...
12 |
...
int | minute(string |
...
date) |
...
Returns |
...
the |
...
minute |
...
of |
...
the |
...
timestamp |
...
int | second(string |
...
date) |
...
Returns |
...
the |
...
second |
...
of |
...
the |
...
timestamp |
...
int | weekofyear(string |
...
date) |
...
Return |
...
the |
...
week |
...
number |
...
of |
...
a |
...
timestamp |
...
string: |
...
weekofyear("1970-11-01 |
...
00:00:00") |
...
= |
...
44, |
...
weekofyear("1970-11-01") |
...
= |
...
44 |
...
int | datediff(string |
...
enddate, |
...
string |
...
startdate) |
...
Return |
...
the |
...
number |
...
of |
...
days |
...
from |
...
startdate |
...
to |
...
enddate: |
...
datediff('2009-03-01', |
...
'2009-02-27') |
...
= |
...
2 |
...
int | date_add(string |
...
startdate, |
...
int |
...
days) |
...
Add |
...
a |
...
number |
...
of |
...
days |
...
to |
...
startdate: |
...
date_add('2008-12-31', |
...
1) |
...
= |
...
'2009-01-01' |
...
int | date_sub(string |
...
startdate, |
...
int |
...
days) |
...
Subtract |
...
a |
...
number |
...
of |
...
days |
...
to |
...
startdate: |
...
date_sub('2008-12-31', |
...
1) |
...
= |
...
'2008-12-30' |
...
Conditional Functions
T | if(boolean |
...
testCondition, |
...
T |
...
valueTrue, |
...
T |
...
valueFalseOrNull) |
...
Return |
...
valueTrue |
...
when |
...
testCondition |
...
is |
...
true, |
...
returns |
...
valueFalseOrNull |
...
otherwise |
...
T | COALESCE(T |
...
v1, |
...
T |
...
v2, |
...
...) |
...
Return |
...
the |
...
first |
...
v |
...
that |
...
is |
...
not |
...
NULL, |
...
or |
...
NULL |
...
if |
...
all |
...
v's |
...
are |
...
NULL |
...
T | CASE a WHEN b THEN c WHEN d THEN e* ELSE f END | When a = b, returns c; when a = d, return e; else return f |
T | CASE WHEN a THEN b WHEN c THEN d* ELSE e END | When a = true, returns b; when c = true, return d; else return e |
String Functions
The following are built-in String functions are supported in hive:
Return Type | Name(Signature) | Description |
int | length(string A) | Returns the length of the string |
string | reverse(string A) | Returns the reversed string |
string | concat(string A, string B...) | Returns the string resulting from concatenating the strings passed in as parameters in order. e.g. concat('foo', 'bar') results in 'foobar'. Note that this function can take any number of input strings. |
string | concat_ws(string SEP, string A, string B...) | Like concat() above, but with custom separator SEP. |
string | substr(string A, int start) substring(string A, int start) | Returns the substring of A starting from start position till the end of string A e.g. substr('foobar', 4) results in 'bar' (see [http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_substr |
...
]) |
...
string | substr(string |
...
A, |
...
int |
...
start, |
...
int |
...
len) |
...
substring(string |
...
A, |
...
int |
...
start, |
...
int |
...
len) |
...
Returns |
...
the |
...
substring |
...
of |
...
A |
...
starting |
...
from |
...
start |
...
position |
...
with |
...
length |
...
len |
...
e.g. |
...
substr('foobar', |
...
4, |
...
1) |
...
results |
...
in |
...
'b' |
...
(see |
...
[ |
...
...
]) |
...
string | upper(string |
...
A) |
...
ucase(string |
...
A) |
...
Returns |
...
the |
...
string |
...
resulting |
...
from |
...
converting |
...
all |
...
characters |
...
of |
...
A |
...
to |
...
upper |
...
case |
...
e.g. |
...
upper('fOoBaR') |
...
results |
...
in |
...
'FOOBAR' |
...
string | lower(string |
...
A) |
...
lcase(string |
...
A) |
...
Returns |
...
the |
...
string |
...
resulting |
...
from |
...
converting |
...
all |
...
characters |
...
of |
...
B |
...
to |
...
lower |
...
case |
...
e.g. |
...
lower('fOoBaR') |
...
results |
...
in |
...
'foobar' |
...
string | trim(string |
...
A) |
...
Returns |
...
the |
...
string |
...
resulting |
...
from |
...
trimming |
...
spaces |
...
from |
...
both |
...
ends |
...
of |
...
A |
...
e.g. |
...
trim(' |
...
foobar |
...
') |
...
results |
...
in |
...
'foobar' |
...
string | ltrim(string |
...
A) |
...
Returns |
...
the |
...
string |
...
resulting |
...
from |
...
trimming |
...
spaces |
...
from |
...
the |
...
beginning(left |
...
hand |
...
side) |
...
of |
...
A |
...
e.g. |
...
ltrim(' |
...
foobar |
...
') |
...
results |
...
in |
...
'foobar |
...
' |
...
string | rtrim(string |
...
A) |
...
Returns |
...
the |
...
string |
...
resulting |
...
from |
...
trimming |
...
spaces |
...
from |
...
the |
...
end(right |
...
hand |
...
side) |
...
of |
...
A |
...
e.g. |
...
rtrim(' |
...
foobar |
...
') |
...
results |
...
in |
...
' |
...
foobar' |
...
string | regexp_replace(string |
...
A, |
...
string |
...
B, |
...
string |
...
C) |
...
Returns |
...
the |
...
string |
...
resulting |
...
from |
...
replacing |
...
all |
...
substrings |
...
in |
...
B |
...
that |
...
match |
...
the |
...
Java |
...
regular |
...
expression |
...
syntax(See |
...
Java |
...
regular |
...
expressions |
...
syntax) |
...
with |
...
C |
...
e.g. |
...
regexp_replace("foobar", |
...
"oo |
...
|ar", |
...
"") |
...
returns |
...
'fb.' |
...
Note |
...
that |
...
some |
...
care |
...
is |
...
necessary |
...
in |
...
using |
...
predefined |
...
character |
...
classes: |
...
using |
...
'\s' |
...
as |
...
the |
...
second |
...
argument |
...
will |
...
match |
...
the |
...
letter |
...
s; |
...
' |
...
|
...
is |
...
necessary |
...
to |
...
match |
...
whitespace, |
...
etc. |
...
string | regexp_extract(string |
...
subject, |
...
string |
...
pattern, |
...
int |
...
index) |
...
Returns |
...
the |
...
string |
...
extracted |
...
using |
...
the |
...
pattern. |
...
e.g. |
...
regexp_extract('foothebar', |
...
'foo(.*?)(bar)', |
...
2) |
...
returns |
...
'bar.' |
...
Note |
...
that |
...
some |
...
care |
...
is |
...
necessary |
...
in |
...
using |
...
predefined |
...
character |
...
classes: |
...
using |
...
'\s' |
...
as |
...
the |
...
second |
...
argument |
...
will |
...
match |
...
the |
...
letter |
...
s; |
...
' |
...
|
...
is |
...
necessary |
...
to |
...
match |
...
whitespace, |
...
etc. |
...
The |
...
'index' |
...
parameter |
...
is |
...
the |
...
Java |
...
regex |
...
Matcher |
...
group() |
...
method |
...
index. |
...
See |
...
docs/api/java/util/regex/Matcher.html |
...
for |
...
more |
...
information |
...
on |
...
the |
...
'index' |
...
or |
...
Java |
...
regex |
...
group() |
...
method. |
...
string | parse_url(string |
...
urlString, |
...
string |
...
partToExtract |
...
...
...
...
) |
...
Returns |
...
the |
...
specified |
...
part |
...
from |
...
the |
...
URL. |
...
Valid |
...
values |
...
for |
...
partToExtract |
...
include |
...
HOST, |
...
PATH, |
...
QUERY, |
...
REF, |
...
PROTOCOL, |
...
AUTHORITY, |
...
FILE, |
...
and |
...
USERINFO. |
...
e.g. |
...
parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', |
...
'HOST') |
...
returns |
...
'facebook.com'. |
...
Also |
...
a |
...
value |
...
of |
...
a |
...
particular |
...
key |
...
in |
...
QUERY |
...
can |
...
be |
...
extracted |
...
by |
...
providing |
...
the |
...
key |
...
as |
...
the |
...
third |
...
argument, |
...
e.g. |
...
parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', |
...
'QUERY', |
...
'k1') |
...
returns |
...
'v1'. |
...
string | get_json_object(string |
...
json_string, |
...
string |
...
path) |
...
Extract |
...
json |
...
object |
...
from |
...
a |
...
json |
...
string |
...
based |
...
on |
...
json |
...
path |
...
specified, |
...
and |
...
return |
...
json |
...
string |
...
of |
...
the |
...
extracted |
...
json |
...
object. |
...
It |
...
will |
...
return |
...
null |
...
if |
...
the |
...
input |
...
json |
...
string |
...
is |
...
invalid. |
...
NOTE: |
...
The |
...
json |
...
path |
...
can |
...
only |
...
have |
...
the |
...
characters |
...
...
, |
...
i.e., |
...
no |
...
upper-case |
...
or |
...
special |
...
characters. |
...
Also, |
...
the |
...
keys |
...
*cannot |
...
start |
...
with |
...
numbers.* |
...
This |
...
is |
...
due |
...
to |
...
restrictions |
...
on |
...
Hive |
...
column |
...
names. |
...
string | space(int |
...
n) |
...
Return |
...
a |
...
string |
...
of |
...
n |
...
spaces |
...
string | repeat(string |
...
str, |
...
int |
...
n) |
...
Repeat |
...
str |
...
n |
...
times |
...
int | ascii(string |
...
str) |
...
Returns |
...
the |
...
numeric |
...
value |
...
of |
...
the |
...
first |
...
character |
...
of |
...
str |
...
string | lpad(string |
...
str, |
...
int |
...
len, |
...
string |
...
pad) |
...
Returns |
...
str, |
...
left-padded |
...
with |
...
pad |
...
to |
...
a |
...
length |
...
of |
...
len |
...
string | rpad(string |
...
str, |
...
int |
...
len, |
...
string |
...
pad) |
...
Returns |
...
str, |
...
right-padded |
...
with |
...
pad |
...
to |
...
a |
...
length |
...
of |
...
len |
...
array | split(string |
...
str, |
...
string |
...
pat) |
...
Split |
...
str |
...
around |
...
pat |
...
(pat |
...
is |
...
a |
...
regular |
...
expression) |
...
int | find_in_set(string |
...
str, |
...
string |
...
strList) |
...
Returns |
...
the |
...
first |
...
occurance |
...
of |
...
str |
...
in |
...
strList |
...
where |
...
strList |
...
is |
...
a |
...
comma-delimited |
...
string. |
...
Returns |
...
null |
...
if |
...
either |
...
argument |
...
is |
...
null. |
...
Returns |
...
0 |
...
if |
...
the |
...
first |
...
argument |
...
contains |
...
any |
...
commas. |
...
e.g. |
...
find_in_set('ab', |
...
'abc,b,ab,c,def') |
...
returns |
...
3 |
...
array<array<string>> | sentences(string |
...
str, |
...
string |
...
lang, |
...
string |
...
locale) |
...
Tokenizes |
...
a |
...
string |
...
of |
...
natural |
...
language |
...
text |
...
into |
...
words |
...
and |
...
sentences, |
...
where |
...
each |
...
sentence |
...
is |
...
broken |
...
at |
...
the |
...
appropriate |
...
sentence |
...
boundary |
...
and |
...
returned |
...
as |
...
an |
...
array |
...
of |
...
words. |
...
The |
...
'lang' |
...
and |
...
'locale' |
...
are |
...
optional |
...
arguments. |
...
e.g. |
...
sentences('Hello |
...
there! |
...
How |
...
are |
...
you?') |
...
returns |
...
( |
...
("Hello", |
...
"there"), |
...
("How", |
...
"are", |
...
"you") |
...
) |
...
array<struct<string,double>> |
...
ngrams(array<array<string>>, |
...
int |
...
N, |
...
int |
...
K, |
...
int |
...
pf) |
...
Returns |
...
the |
...
top-k |
...
N-grams |
...
from |
...
a |
...
set |
...
of |
...
tokenized |
...
sentences, |
...
such |
...
as |
...
those |
...
returned |
...
by |
...
the |
...
sentences() |
...
UDAF. |
...
See |
...
...
for |
...
more |
...
information. |
...
array<struct<string,double>> |
...
context_ngrams(array<array<string>>, |
...
array<string>, |
...
int |
...
K, |
...
int |
...
pf) |
...
Returns |
...
the |
...
top-k |
...
contextual |
...
N-grams |
...
from |
...
a |
...
set |
...
of |
...
tokenized |
...
sentences, |
...
given |
...
a |
...
string |
...
of |
...
"context". |
...
See |
...
...
for |
...
more |
...
information. |
...
Misc.
...
Functions
...
xpath
The following functions are described in Hive-LanguageManual-XPathUDF
...
:
...
- xpath,
...
- xpath_short,
...
- xpath_int,
...
- xpath_long,
...
- xpath_float,
...
- xpath_double,
...
- xpath_number,
...
- xpath_string
...
get_json_object
...
- A
...
- limited
...
- version
...
- of
...
- JSONPath
...
- is
...
- supported:
- $ : Root object
- . : Child operator
- [] : Subscript operator for array
- * : Wildcard for []
Syntax not supported that's worth noticing:
- : Zero length string as key
- .. : Recursive descent
- @ : Current object/element
- () : Script expression
- ?() : Filter (script) expression.
- , : Union operator
- end.step : array slice operator
Example: src_json table is a single column (json), single row table:
Code Block |
---|
* $ : Root object * . : Child operator * [] : Subscript operator for array * * : Wildcard for [] Syntax not supported that's worth noticing: * : Zero length string as key * .. : Recursive descent * @ : Current object/element * () : Script expression * ?() : Filter (script) expression. * [,] : Union operator * [start:end.step] : array slice operator Example: src_json table is a single column (json), single row table: {code} +----+ json +----+ {"store": {"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}], "bicycle":{"price":19.95,"color":"red"} }, "email":"amy@only_for_json_udf_test.net", "owner":"amy" } +----+ {code} |
The
...
fields
...
of
...
the
...
json
...
object
...
can
...
be
...
extracted
...
using
...
these
...
queries:
Code Block |
---|
} hive> SELECT get_json_object(src_json.json, '$.owner') FROM src_json; amy hive> SELECT get_json_object(src_json.json, '$.store.fruit[0]') FROM src_json; {"weight":8,"type":"apple"} hive> SELECT get_json_object(src_json.json, '$.non_exist_key') FROM src_json; NULL {code} h2. |
Built-in
...
Aggregate
...
Functions
...
(UDAF)
...
The
...
following
...
are
...
built-in
...
aggregate
...
functions
...
are
...
supported
...
in
...
Hive:
...
Return |
...
Type* |
...
* |
...
Name(Signature)* |
...
* |
...
Description* |
...
* |
...
bigint | count(*), |
...
count(expr), |
...
count(DISTINCT |
...
expr |
...
...
...
) |
...
count(*) |
...
- |
...
Returns |
...
the |
...
total |
...
number |
...
of |
...
retrieved |
...
rows, |
...
including |
...
rows |
...
containing |
...
NULL |
...
values; |
...
count(expr) |
...
- |
...
Returns |
...
the |
...
number |
...
of |
...
rows |
...
for |
...
which |
...
the |
...
supplied |
...
expression |
...
is |
...
non-NULL; |
...
count(DISTINCT |
...
expr |
...
...
...
) |
...
- |
...
Returns |
...
the |
...
number |
...
of |
...
rows |
...
for |
...
which |
...
the |
...
supplied |
...
expression(s) |
...
are |
...
unique |
...
and |
...
non-NULL. |
...
double | sum(col), |
...
sum(DISTINCT |
...
col) |
...
Returns |
...
the |
...
sum |
...
of |
...
the |
...
elements |
...
in |
...
the |
...
group |
...
or |
...
the |
...
sum |
...
of |
...
the |
...
distinct |
...
values |
...
of |
...
the |
...
column |
...
in |
...
the |
...
group |
...
double | avg(col), |
...
avg(DISTINCT |
...
col) |
...
Returns |
...
the |
...
average |
...
of |
...
the |
...
elements |
...
in |
...
the |
...
group |
...
or |
...
the |
...
average |
...
of |
...
the |
...
distinct |
...
values |
...
of |
...
the |
...
column |
...
in |
...
the |
...
group |
...
double | min(col) |
...
Returns |
...
the |
...
minimum |
...
of |
...
the |
...
column |
...
in |
...
the |
...
group |
...
double | max(col) |
...
Returns |
...
the |
...
maximum |
...
value |
...
of |
...
the |
...
column |
...
in |
...
the |
...
group |
...
double | var_pop(col) |
...
Returns |
...
the |
...
variance |
...
of |
...
a |
...
numeric |
...
column |
...
in |
...
the |
...
group |
...
double | var_samp(col) |
...
Returns |
...
the |
...
unbiased |
...
sample |
...
variance |
...
of |
...
a |
...
numeric |
...
column |
...
in |
...
the |
...
group |
...
double | stddev_pop(col) |
...
Returns |
...
the |
...
standard |
...
deviation |
...
of |
...
a |
...
numeric |
...
column |
...
in |
...
the |
...
group |
...
double | stddev_samp(col) |
...
Returns |
...
the |
...
unbiased |
...
sample |
...
standard |
...
deviation |
...
of |
...
a |
...
numeric |
...
column |
...
in |
...
the |
...
group |
...
double | covar_pop(col1, |
...
col2) |
...
Returns |
...
the |
...
population |
...
covariance |
...
of |
...
a |
...
pair |
...
of |
...
numeric |
...
columns |
...
in |
...
the |
...
group |
...
double | covar_samp(col1, |
...
col2) |
...
Returns |
...
the |
...
sample |
...
covariance |
...
of |
...
a |
...
pair |
...
of |
...
a |
...
numeric |
...
columns |
...
in |
...
the |
...
group |
...
double | corr(col1, |
...
col2) |
...
Returns |
...
the |
...
Pearson |
...
coefficient |
...
of |
...
correlation |
...
of |
...
a |
...
pair |
...
of |
...
a |
...
numeric |
...
columns |
...
in |
...
the |
...
group |
...
double | percentile(col, |
...
p) |
...
Returns |
...
the |
...
exact |
...
p^th^ |
...
percentile |
...
of |
...
an |
...
integer |
...
column |
...
in |
...
the |
...
group |
...
(does |
...
not |
...
work |
...
with |
...
floating |
...
point |
...
types). |
...
p |
...
must |
...
be |
...
between |
...
0 |
...
and |
...
1. |
...
array<double> | percentile(col, |
...
array(p~1,, |
...
...
...
...)) |
...
Returns |
...
the |
...
exact |
...
percentiles |
...
p,,1,,, |
...
p,,2,,, |
...
... |
...
of |
...
an |
...
integer |
...
column |
...
in |
...
the |
...
group |
...
(does |
...
not |
...
work |
...
with |
...
floating |
...
point |
...
types). |
...
p,,i~ |
...
must |
...
be |
...
between |
...
0 |
...
and |
...
1. |
...
double | percentile_approx(col, |
...
p |
...
...
...
) |
...
Returns |
...
an |
...
approximate |
...
p^th^ |
...
percentile |
...
of |
...
a |
...
numeric |
...
column |
...
(including |
...
floating |
...
point |
...
types) |
...
in |
...
the |
...
group. |
...
The |
...
B |
...
parameter |
...
controls |
...
approximation |
...
accuracy |
...
at |
...
the |
...
cost |
...
of |
...
memory. |
...
Higher |
...
values |
...
yield |
...
better |
...
approximations, |
...
and |
...
the |
...
default |
...
is |
...
10,000. |
...
When |
...
the |
...
number |
...
of |
...
distinct |
...
values |
...
in |
...
col |
...
is |
...
smaller |
...
than |
...
B, |
...
this |
...
gives |
...
an |
...
exact |
...
percentile |
...
value. |
...
array<double> | percentile_approx(col, |
...
array(p~1,, |
...
...
...
...) |
...
...
...
) |
...
Same |
...
as |
...
above, |
...
but |
...
accepts |
...
and |
...
returns |
...
an |
...
array |
...
of |
...
percentile |
...
values |
...
instead |
...
of |
...
a |
...
single |
...
one. |
...
array<struct { |
...
} |
...
> |
...
histogram_numeric(col, |
...
b) |
...
Computes |
...
a |
...
histogram |
...
of |
...
a |
...
numeric |
...
column |
...
in |
...
the |
...
group |
...
using |
...
b |
...
non-uniformly |
...
spaced |
...
bins. |
...
The |
...
output |
...
is |
...
an |
...
array |
...
of |
...
size |
...
b |
...
of |
...
double-valued |
...
(x,y) |
...
coordinates |
...
that |
...
represent |
...
the |
...
bin |
...
centers |
...
and |
...
heights |
...
array | collect_set(col) |
...
Returns |
...
a |
...
set |
...
of |
...
objects |
...
with |
...
duplicate |
...
elements |
...
eliminated |
Built-in
...
Table-Generating
...
Functions
...
(UDTF)
...
<<Anchor(UDTF)>>
...
Normal
...
user-defined
...
functions,
...
such
...
as
...
concat(),
...
take
...
in
...
a
...
single
...
input
...
row
...
and
...
output
...
a
...
single
...
output
...
row.
...
In
...
contrast,
...
table-generating
...
functions
...
transform
...
a
...
single
...
input
...
row
...
to
...
multiple
...
output
...
rows.
...
explode
explode()
...
takes
...
in
...
an
...
array
...
as
...
an
...
input
...
and
...
outputs
...
the
...
elements
...
of
...
the
...
array
...
as
...
separate
...
rows.
...
UDTF's
...
can
...
be
...
used
...
in
...
the
...
SELECT
...
expression
...
list
...
and
...
as
...
a
...
part
...
of
...
LATERAL
...
VIEW.
...
An
...
example
...
use
...
of
...
explode()
...
in
...
the
...
SELECT
...
expression
...
list
...
is
...
as
...
follows:
...
Consider
...
a
...
table
...
named
...
myTable
...
that
...
has
...
a
...
single
...
column
...
(myCol)
...
and
...
two
...
rows:
...
Array<int> |
...
myCol |
...
...
...
Then running the query:
SELECT explode(myCol)
...
AS
...
myNewCol
...
FROM
...
myTable;
...
Will
...
produce:
...
int |
...
myNewCol |
...
1 |
2 |
3 |
4 |
5 |
6 |
Using the syntax "SELECT udtf(col)
...
AS
...
colAlias..."
...
has
...
a
...
few
...
limitations:
...
- No
...
- other
...
- expressions
...
- are
...
- allowed
...
- in
...
- SELECT
...
- SELECT
...
- pageid,
...
- explode(adid_list)
...
- AS
...
- myCol...
...
- is
...
- not
...
- supported
...
- UDTF's
...
- can't
...
- be
...
- nested
...
- SELECT
...
- explode(explode(adid_list))
...
- AS
...
- myCol...
...
- is
...
- not
...
- supported
...
- GROUP
...
- BY
...
- /
...
- CLUSTER
...
- BY
...
- /
...
- DISTRIBUTE
...
- BY
...
- /
...
- SORT
...
- BY
...
- is
...
- not supported
- SELECT explode(adid_list)
...
- AS
...
- myCol
...
- ...
...
- GROUP
...
- BY
...
- myCol
...
- is
...
- not
...
- supported
...
Please
...
see
...
Hive-LanguageManual-LateralView
...
for
...
an
...
alternative
...
syntax
...
that
...
does
...
not
...
have
...
these
...
limitations.
...
The
...
following
...
are
...
built-in
...
table-generating
...
functions
...
are
...
supported
...
in
...
Hive:
...
Return |
...
Type* |
...
* |
...
Name(Signature)* |
...
* |
...
Description* |
...
* |
...
myType | explode(array<myType> |
...
a) |
...
<<Anchor(explode)>> |
...
For |
...
each |
...
element |
...
in |
...
a, |
...
explode() |
...
generates |
...
a |
...
row |
...
containing |
...
that |
...
element |
json_tuple
...
A
...
new
...
json_tuple()
...
UDTF
...
is
...
introduced
...
in
...
hive
...
0.7.
...
It
...
takes
...
a
...
set
...
of
...
names
...
(keys)
...
and
...
return
...
a
...
tuple
...
of
...
values
...
in
...
one
...
function.
...
If
...
you
...
are
...
using
...
get_json_object()
...
and
...
want
...
to
...
replace
...
it
...
with
...
json_tuple,
...
the
...
only
...
changes
...
is
...
that
...
your
...
query
...
will
...
be
...
using
...
json_tuple()
...
in
...
lateral
...
view
...
rather
...
than
...
multiple
...
get_json_object()
...
in
...
the
...
select
...
clause.
...
For
...
example,
Code Block |
---|
} select a.timestamp, get_json_object(a.appevents, '$.eventid'), get_json_object(a.appenvets, '$.eventname') from log a; {code} |
should
...
be
...
changed
...
to
Code Block |
---|
} select a.timestamp, b.* from log a lateral view json_tuple(a.appevent, 'eventid', 'eventname') b as f1, f2; {code} h2. GROUPing and SORTing on |
GROUPing and SORTing on f(column)
...
A
...
typical
...
OLAP
...
pattern
...
is
...
that
...
you
...
have
...
a
...
timestamp
...
column
...
and
...
you
...
want
...
to
...
group
...
by
...
daily
...
or
...
other
...
less
...
granular
...
date
...
windows
...
than
...
by
...
second.
...
So
...
you
...
might
...
want
...
to
...
select
...
concat(year(dt),month(dt))
...
and
...
then
...
group
...
on
...
that
...
concat().
...
But
...
if
...
you
...
attempt
...
to
...
GROUP
...
BY
...
or
...
SORT
...
BY
...
a
...
column
...
on
...
which
...
you've
...
applied
...
a
...
function
...
and
...
alias,
...
like
...
this:
Code Block |
---|
} select f(col) as fc, count\(*\) from table_name group by fc {code} |
You
...
will
...
get
...
an
...
error:
Code Block |
---|
} FAILED: Error in semantic analysis: line 1:69 Invalid Table Alias or Column Reference fc {code} |
Because
...
you
...
are
...
not
...
able
...
to
...
GROUP
...
BY
...
or
...
SORT
...
BY
...
a
...
column
...
alias
...
on
...
which
...
a
...
function
...
has
...
been
...
applied.
...
There
...
are
...
two
...
workarounds.
...
First,
...
you
...
can
...
reformulate
...
this
...
query
...
with
...
subqueries,
...
which
...
is
...
somewhat
...
complicated:
Code Block |
---|
} select sq.fc,col1,col2,...,colN,count\(*\) from (select f(col) as fc,col1,col2,...,colN from table_name) sq group by sq.fc,col1,col2,...,colN {code} |
Or
...
you
...
can
...
make
...
sure
...
not
...
to
...
use
...
a
...
column
...
alias,
...
which
...
is
...
simpler:
Code Block |
---|
} select f(col) as fc, count\(*\) from table_name group by f(col) {code} |
Contact
...
Tim
...
Ellis
...
(tellis)
...
at
...
RiotGames
...
dot
...
com
...
if
...
you
...
would
...
like
...
to
...
discuss
...
this
...
in
...
further
...
detail.
...