Hive Operators and User-Defined Functions (UDFs)
Table of Contents |
---|
Tip | ||
---|---|---|
| ||
All Hive keywords are case-insensitive, including the names of Hive operators and functions. |
In the CLI, use the commands below to show the latest documentation:
No Format |
---|
SHOW FUNCTIONS;
DESCRIBE FUNCTION <function_name>;
DESCRIBE FUNCTION EXTENDED <function_name>;
|
...
Return Type | Name(Signature) | Description |
---|---|---|
varies | java_method(class, method[, arg1[, arg2..]]) | Synonym for |
varies | reflect(class, method[, arg1[, arg2..]]) | Use this UDF to call Java methods by matching the argument signature (uses reflection). (as of Hive 0.7.0) |
int | hash(a1[, a2...]) | Returns a hash value of the arguments (as of Hive 0.4) |
xpath
The following functions are described in LanguageManual XPathUDF:
...
Example: src_json table is a single column (json), single row table:
Code Block |
---|
+----+
json
+----+
{"store":
{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],
"bicycle":{"price":19.95,"color":"red"}
},
"email":"amy@only_for_json_udf_test.net",
"owner":"amy"
}
+----+
|
The fields of the json object can be extracted using these queries:
Code Block |
---|
hive> SELECT get_json_object(src_json.json, '$.owner') FROM src_json;
amy
hive> SELECT get_json_object(src_json.json, '$.store.fruit\[0]') FROM src_json;
{"weight":8,"type":"apple"}
hive> SELECT get_json_object(src_json.json, '$.non_exist_key') FROM src_json;
NULL
|
...
explode()
takes in an array (or a map) as an input and outputs the elements of the array (map) as separate rows. UDTF's can be used in the SELECT expression list and as a part of LATERAL VIEW.
...
Then running the query:
Code Block | ||||
---|---|---|---|---|
| ||||
SELECT explode(myCol) AS myNewCol FROM myTable;
|
...
(int) myNewCol |
---|
100 |
200 |
300 |
400 |
500 |
600 |
The usage with Maps is similar:
Code Block | ||
---|---|---|
| ||
SELECT explode(myMap) AS myMapKey, myMapValue FROM myMapTable; |
posexplode
Info | ||
---|---|---|
| ||
Available as of Hive 0.13.0. See HIVE-4943. |
...
Then running the query:
Code Block | ||||
---|---|---|---|---|
| ||||
SELECT posexplode(myCol) AS pos, myNewCol FROM myTable;
|
...
A new json_tuple() UDTF is introduced in hive 0.7. It takes a set of names (keys) and a JSON string, and returns a tuple of values using one function. This is much more efficient than calling GET_JSON_OBJECT to retrieve more than one key from a single JSON string. In any case where a single JSON string would be parsed more than once, your query will be more efficient if you parse it once, which is what JSON_TUPLE is for. As JSON_TUPLE is a UDTF, you will need to use the LATERAL VIEW syntax in order to achieve the same goal.
For example,
Code Block |
---|
select a.timestamp, get_json_object(a.appevents, '$.eventid'), get_json_object(a.appenvets, '$.eventname') from log a;
|
should be changed to
Code Block |
---|
select a.timestamp, b.*
from log a lateral view json_tuple(a.appevent, 'eventid', 'eventname') b as f1, f2;
|
...
The parse_url_tuple() UDTF is similar to parse_url(), but can extract multiple parts of a given URL, returning the data in a tuple. Values for a particular key in QUERY can be extracted by appending a colon and the key to the partToExtract argument, e.g. parse_url_tuple('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY:k1', 'QUERY:k2') returns a tuple with values of 'v1','v2'. This is more efficient than calling parse_url() multiple times. All the input parameters and output column types are string.
Code Block |
---|
SELECT b.*
FROM src LATERAL VIEW parse_url_tuple(fullurl, 'HOST', 'PATH', 'QUERY', 'QUERY:id') b as host, path, query, query_id LIMIT 1;
|
...
A typical OLAP pattern is that you have a timestamp column and you want to group by daily or other less granular date windows than by second. So you might want to select concat(year(dt),month(dt)) and then group on that concat(). But if you attempt to GROUP BY or SORT BY a column on which you've applied a function and alias, like this:
Code Block |
---|
select f(col) as fc, count(*) from table_name group by fc;
|
You will get an error:
Code Block |
---|
FAILED: Error in semantic analysis: line 1:69 Invalid Table Alias or Column Reference fc
|
Because you are not able to GROUP BY or SORT BY a column alias on which a function has been applied. There are two workarounds. First, you can reformulate this query with subqueries, which is somewhat complicated:
Code Block |
---|
select sq.fc,col1,col2,...,colN,count(*) from
(select f(col) as fc,col1,col2,...,colN from table_name) sq
group by sq.fc,col1,col2,...,colN;
|
Or you can make sure not to use a column alias, which is simpler:
Code Block |
---|
select f(col) as fc, count(*) from table_name group by f(col);
|
...
The context of a UDF's evaluate method is one row at a time. A simple invocation of a UDF like
Code Block |
---|
SELECT length(string_col) FROM table_name;
|
...
If you would like to control which rows get sent to the same UDF (and possibly in what order), you will have the urge to make the UDF evaluate during the reduce phase. This is achievable by making use of DISTRIBUTE BY, DISTRIBUTE BY + SORT BY, CLUSTER BY. An example query would be:
Code Block |
---|
SELECT reducer_udf(my_col, distribute_col, sort_col) FROM
(SELECT my_col, distribute_col, sort_col FROM table_name DISTRIBUTE BY distribute_col SORT BY distribute_col, sort_col) t
|
...