Hive Operators and User-Defined Functions (UDFs)

Table of Contents

Tip

title	Case-insensitive

All Hive keywords are case-insensitive, including the names of Hive operators and functions.

In the CLI, use the commands below to show the latest documentation:

No Format
SHOW FUNCTIONS; DESCRIBE FUNCTION <function_name>; DESCRIBE FUNCTION EXTENDED <function_name>;

...

Return Type	Name(Signature)	Description
varies	java_method(class, method[, arg1[, arg2..]])	Synonym for `reflect` (as of Hive 0.9.0)
varies	reflect(class, method[, arg1[, arg2..]])	Use this UDF to call Java methods by matching the argument signature (uses reflection). (as of Hive 0.7.0)
int	hash(a1[, a2...])	Returns a hash value of the arguments (as of Hive 0.4)

xpath

The following functions are described in LanguageManual XPathUDF:

...

Example: src_json table is a single column (json), single row table:

Code Block


+----+
                               json
+----+
{"store":
  {"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],
   "bicycle":{"price":19.95,"color":"red"}
  },
 "email":"amy@only_for_json_udf_test.net",
 "owner":"amy"
}
+----+

The fields of the json object can be extracted using these queries:

Code Block


hive> SELECT get_json_object(src_json.json, '$.owner') FROM src_json;
amy

hive> SELECT get_json_object(src_json.json, '$.store.fruit\[0]') FROM src_json;
{"weight":8,"type":"apple"}

hive> SELECT get_json_object(src_json.json, '$.non_exist_key') FROM src_json;
NULL

...

explode() takes in an array (or a map) as an input and outputs the elements of the array (map) as separate rows. UDTF's can be used in the SELECT expression list and as a part of LATERAL VIEW.

...

Then running the query:

Code Block

	sql
	sql


SELECT explode(myCol) AS myNewCol FROM myTable;

...

(int) myNewCol
100
200
300
400
500
600

The usage with Maps is similar:

Code Block

language	sql

SELECT explode(myMap) AS myMapKey, myMapValue FROM myMapTable;

posexplode

Info

title	Version

Available as of Hive 0.13.0. See HIVE-4943.

...

Then running the query:

Code Block

	sql
	sql


SELECT posexplode(myCol) AS pos, myNewCol FROM myTable;

...

A new json_tuple() UDTF is introduced in hive 0.7. It takes a set of names (keys) and a JSON string, and returns a tuple of values using one function. This is much more efficient than calling GET_JSON_OBJECT to retrieve more than one key from a single JSON string. In any case where a single JSON string would be parsed more than once, your query will be more efficient if you parse it once, which is what JSON_TUPLE is for. As JSON_TUPLE is a UDTF, you will need to use the LATERAL VIEW syntax in order to achieve the same goal.

For example,

Code Block
select a.timestamp, get_json_object(a.appevents, '$.eventid'), get_json_object(a.appenvets, '$.eventname') from log a;

should be changed to

Code Block
select a.timestamp, b.* from log a lateral view json_tuple(a.appevent, 'eventid', 'eventname') b as f1, f2;

...

The parse_url_tuple() UDTF is similar to parse_url(), but can extract multiple parts of a given URL, returning the data in a tuple. Values for a particular key in QUERY can be extracted by appending a colon and the key to the partToExtract argument, e.g. parse_url_tuple('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY:k1', 'QUERY:k2') returns a tuple with values of 'v1','v2'. This is more efficient than calling parse_url() multiple times. All the input parameters and output column types are string.

Code Block
SELECT b.* FROM src LATERAL VIEW parse_url_tuple(fullurl, 'HOST', 'PATH', 'QUERY', 'QUERY:id') b as host, path, query, query_id LIMIT 1;

...

A typical OLAP pattern is that you have a timestamp column and you want to group by daily or other less granular date windows than by second. So you might want to select concat(year(dt),month(dt)) and then group on that concat(). But if you attempt to GROUP BY or SORT BY a column on which you've applied a function and alias, like this:

Code Block
select f(col) as fc, count(*) from table_name group by fc;

You will get an error:

Code Block
FAILED: Error in semantic analysis: line 1:69 Invalid Table Alias or Column Reference fc

Because you are not able to GROUP BY or SORT BY a column alias on which a function has been applied. There are two workarounds. First, you can reformulate this query with subqueries, which is somewhat complicated:

Code Block
select sq.fc,col1,col2,...,colN,count(*) from (select f(col) as fc,col1,col2,...,colN from table_name) sq group by sq.fc,col1,col2,...,colN;

Or you can make sure not to use a column alias, which is simpler:

Code Block
select f(col) as fc, count(*) from table_name group by f(col);

...

The context of a UDF's evaluate method is one row at a time. A simple invocation of a UDF like

Code Block
SELECT length(string_col) FROM table_name;

...

If you would like to control which rows get sent to the same UDF (and possibly in what order), you will have the urge to make the UDF evaluate during the reduce phase. This is achievable by making use of DISTRIBUTE BY, DISTRIBUTE BY + SORT BY, CLUSTER BY. An example query would be:

Code Block
SELECT reducer_udf(my_col, distribute_col, sort_col) FROM (SELECT my_col, distribute_col, sort_col FROM table_name DISTRIBUTE BY distribute_col SORT BY distribute_col, sort_col) t

...

Space shortcuts

Child pages

Versions Compared

Old Version 71

New Version 72

Key

Hive Operators and User-Defined Functions (UDFs)

xpath

posexplode

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 71

New Version 72

Key

Hive Operators and User-Defined Functions (UDFs)

xpath

posexplode