Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Removed incorrect mention of 2.0 - maybe it meant 0.2? Anyway, transform has been around in hive from before hive was an apache project (from when it was a sub-project of hadoop)

Table of Contents

Transform/Map-Reduce Syntax

Users can also plug in their own custom mappers and reducers in the data stream by using features natively supported in the Hive 2.0 language. e.g. in order to run a custom mapper script - map_script - and a custom reducer script - reduce_script - the user can issue the following command which uses the TRANSFORM clause to embed the mapper and the reducer scripts.

...

Please also see Sort By / Cluster By / Distribute By and Larry Ogrodnek's blog post.

Code Block

clusterBy: CLUSTER BY colName (',' colName)*
distributeBy: DISTRIBUTE BY colName (',' colName)*
sortBy: SORT BY colName (ASC | DESC)? (',' colName (ASC | DESC)?)*

rowFormat
  : ROW FORMAT
    (DELIMITED [FIELDS TERMINATED BY char] 
               [COLLECTION ITEMS TERMINATED BY char]
               [MAP KEYS TERMINATED BY char]
               [ESCAPED BY char]
               [LINES SEPARATED BY char]
     | 
     SERDE serde_name [WITH SERDEPROPERTIES 
                            property_name=property_value, 
                            property_name=property_value, ...])

outRowFormat : rowFormat
inRowFormat : rowFormat
outRecordReader : RECORDREADER className

query:
  FROM (
    FROM src
    MAP expression (',' expression)*
    (inRowFormat)?
    USING 'my_map_script'
    ( AS colName (',' colName)* )?
    (outRowFormat)? (outRecordReader)?
    ( clusterBy? | distributeBy? sortBy? ) src_alias
  )
  REDUCE expression (',' expression)*
    (inRowFormat)?
    USING 'my_reduce_script'
    ( AS colName (',' colName)* )?
    (outRowFormat)? (outRecordReader)?

  FROM (
    FROM src
    SELECT TRANSFORM '(' expression (',' expression)* ')'
    (inRowFormat)?
    USING 'my_map_script'
    ( AS colName (',' colName)* )?
    (outRowFormat)? (outRecordReader)?
    ( clusterBy? | distributeBy? sortBy? ) src_alias
  )
  SELECT TRANSFORM '(' expression (',' expression)* ')'
    (inRowFormat)? 
    USING 'my_reduce_script'
    ( AS colName (',' colName)* )?
    (outRowFormat)? (outRecordReader)?

SQL Standard Based Authorization Disallows TRANSFORM

The TRANSFORM clause is disallowed when SQL standard based authorization is configured in Hive 0.13.0 and later releases (HIVE-6415).

TRANSFORM Examples

Example #1:

Code Block

  FROM (
    FROM pv_users
    MAP pv_users.userid, pv_users.date
    USING 'map_script'
    AS dt, uid
    CLUSTER BY dt) map_output
  INSERT OVERWRITE TABLE pv_users_reduced
    REDUCE map_output.dt, map_output.uid
    USING 'reduce_script'
    AS date, count;
  FROM (
    FROM pv_users
    SELECT TRANSFORM(pv_users.userid, pv_users.date)
    USING 'map_script'
    AS dt, uid
    CLUSTER BY dt) map_output
  INSERT OVERWRITE TABLE pv_users_reduced
    SELECT TRANSFORM(map_output.dt, map_output.uid)
    USING 'reduce_script'
    AS date, count;

Example #2

Code Block

  FROM (
    FROM src
    SELECT TRANSFORM(src.key, src.value) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe'
    USING '/bin/cat'
    AS (tkey, tvalue) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.TypedBytesSerDe'
    RECORDREADER 'org.apache.hadoop.hive.qlcontrib.util.exectypedbytes.TypedBytesRecordReader'
  ) tmap
  INSERT OVERWRITE TABLE dest1 SELECT tkey, tvalue

...

Note that we can directly do CLUSTER BY key without specifying the output schema of the scripts.

Code Block

  FROM (
    FROM pv_users
    MAP pv_users.userid, pv_users.date
    USING 'map_script'
    CLUSTER BY key) map_output
  INSERT OVERWRITE TABLE pv_users_reduced
    REDUCE map_output.key, map_output.value
    USING 'reduce_script'
    AS date, count;

...

The output fields from a script are typed as strings by default; for example in

Code Block

  SELECT TRANSFORM(stuff)
  USING 'script'
  AS thing1, thing2

They can be immediately casted with the syntax:

Code Block

  SELECT TRANSFORM(stuff)
  USING 'script'
  AS (thing1 INT, thing2 INT)