THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!
Table of Contents |
---|
Transform/Map-Reduce Syntax
...
Please also see Sort By / Cluster By / Distribute By and Larry Ogrodnek's blog post.
Code Block |
---|
clusterBy: CLUSTER BY colName (',' colName)*
distributeBy: DISTRIBUTE BY colName (',' colName)*
sortBy: SORT BY colName (ASC | DESC)? (',' colName (ASC | DESC)?)*
rowFormat
: ROW FORMAT
(DELIMITED [FIELDS TERMINATED BY char]
[COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char]
[ESCAPED BY char]
[LINES SEPARATED BY char]
|
SERDE serde_name [WITH SERDEPROPERTIES
property_name=property_value,
property_name=property_value, ...])
outRowFormat : rowFormat
inRowFormat : rowFormat
outRecordReader : RECORDREADER className
query:
FROM (
FROM src
MAP expression (',' expression)*
(inRowFormat)?
USING 'my_map_script'
( AS colName (',' colName)* )?
(outRowFormat)? (outRecordReader)?
( clusterBy? | distributeBy? sortBy? ) src_alias
)
REDUCE expression (',' expression)*
(inRowFormat)?
USING 'my_reduce_script'
( AS colName (',' colName)* )?
(outRowFormat)? (outRecordReader)?
FROM (
FROM src
SELECT TRANSFORM '(' expression (',' expression)* ')'
(inRowFormat)?
USING 'my_map_script'
( AS colName (',' colName)* )?
(outRowFormat)? (outRecordReader)?
( clusterBy? | distributeBy? sortBy? ) src_alias
)
SELECT TRANSFORM '(' expression (',' expression)* ')'
(inRowFormat)?
USING 'my_reduce_script'
( AS colName (',' colName)* )?
(outRowFormat)? (outRecordReader)?
|
SQL Standard Based Authorization Disallows TRANSFORM
The TRANSFORM clause is disallowed when SQL standard based authorization is configured in Hive 0.13.0 and later releases (HIVE-6415).
TRANSFORM Examples
Example #1:
Code Block |
---|
FROM (
FROM pv_users
MAP pv_users.userid, pv_users.date
USING 'map_script'
AS dt, uid
CLUSTER BY dt) map_output
INSERT OVERWRITE TABLE pv_users_reduced
REDUCE map_output.dt, map_output.uid
USING 'reduce_script'
AS date, count;
FROM (
FROM pv_users
SELECT TRANSFORM(pv_users.userid, pv_users.date)
USING 'map_script'
AS dt, uid
CLUSTER BY dt) map_output
INSERT OVERWRITE TABLE pv_users_reduced
SELECT TRANSFORM(map_output.dt, map_output.uid)
USING 'reduce_script'
AS date, count;
|
Example #2
Code Block |
---|
FROM (
FROM src
SELECT TRANSFORM(src.key, src.value) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe'
USING '/bin/cat'
AS (tkey, tvalue) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe'
RECORDREADER 'org.apache.hadoop.hive.ql.exec.TypedBytesRecordReader'
) tmap
INSERT OVERWRITE TABLE dest1 SELECT tkey, tvalue
|
...
Note that we can directly do CLUSTER BY key without specifying the output schema of the scripts.
Code Block |
---|
FROM (
FROM pv_users
MAP pv_users.userid, pv_users.date
USING 'map_script'
CLUSTER BY key) map_output
INSERT OVERWRITE TABLE pv_users_reduced
REDUCE map_output.key, map_output.value
USING 'reduce_script'
AS date, count;
|
...
The output fields from a script are typed as strings by default; for example in
Code Block |
---|
SELECT TRANSFORM(stuff)
USING 'script'
AS thing1, thing2
|
They can be immediately casted with the syntax:
Code Block |
---|
SELECT TRANSFORM(stuff)
USING 'script'
AS (thing1 INT, thing2 INT)
|