Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Infers the schema of the Hive table from the Avro schema. Since Hive 0.14, Avro schema can be inferred from Hive table schema.
  • Reads all Avro files within a table against a specified schema, taking advantage of Avro's backwards compatibility abilities
  • Supports arbitrarily nested schemas.
  • Translates all Avro data types into equivalent Hive types. Most types map exactly, but some Avro types don't exist in Hive and are automatically converted by the AvroSerde.
  • Understands compressed Avro files.
  • Transparently converts the Avro idiom of handling nullable types as Union[T, null] into just T and returns null when appropriate.
  • Writes any Hive table to Avro files.
  • Has worked reliably against our most convoluted Avro schemas in our ETL process.

...

Creating Avro-backed Hive tables

Avro-backed table can be created in Hive using AvroSerDe.

All Hive versions

To create a the Avro-backed table, specify the serde as org.apache.hadoop.hive.serde2.avro.AvroSerDe, specify the inputformat as org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat, and the outputformat as org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat. Also provide a location from which the AvroSerde will pull the most current schema for the table. For example:

Code Block
CREATE TABLE kst
  PARTITIONED BY (ds string)
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  TBLPROPERTIES (
    'avro.schema.url'='http://schema_provider/kst.avsc');

In this example we're pulling the source-of-truth reader schema from a webserver. Other options for providing the schema are described below.
Add the Avro files to the database (or create an external table) using standard Hive operations (http://wiki.apache.org/hadoop/Hive/LanguageManual/DML).
This table might result in a description as below:

Code Block
hive> describe kst;
OK
string1 string  from deserializer
string2 string  from deserializer
int1    int     from deserializer
boolean1        boolean from deserializer
long1   bigint  from deserializer
float1  float   from deserializer
double1 double  from deserializer
inner_record1   

 
For Hive 0.14 and later:
CREATE TABLE kst (
  string1 string,
  string2 string,
  int1 int,
  boolean1 boolean,
  long1 bigint,
  float1 float,
  double1 double,
  inner_record1 struct<int_in_inner_record1:int,string_in_inner_record1:string>,
 from deserializer
enum1 string,
  string  from deserializer
array1  array<string>,
   from deserializer
map1    map<string,string>      from deserializer
union1 ,
 union1 uniontype<float,boolean,string>,
 fromfixed1 deserializerbinary,
fixed1 null1 binaryvoid,
 unionnullint from deserializerint,
null1 bytes1 binary)
 void PARTITIONED BY  from deserializer
unionnullint(ds string)
  STORED  int     from deserializer
bytes1  binary  from deserializer

At this point, the Avro-backed table can be worked with in Hive like any other table.

Writing tables to Avro files

The AvroSerde can serialize any Hive table to Avro files. This makes it effectively an any-Hive-type to Avro converter. In order to write a table to an Avro file, you must first create an appropriate Avro schema. Create as select type statements are not currently supported. Types translate as detailed in the table above. For types that do not translate directly, there are a few items to keep in mind:

  • Types that may be null must be defined as a union of that type and Null within Avro. A null in a field that is not so defined with result in an exception during the save. No changes need be made to the Hive schema to support this, as all fields in Hive can be null.
  • Avro Bytes type should be defined in Hive as lists of tiny ints. the AvroSerde will convert these to Bytes during the saving process.
  • Avro Fixed type should be defined in Hive as lists of tiny ints. the AvroSerde will convert these to Fixed during the saving process.
  • Avro Enum type should be defined in Hive as strings, since Hive doesn't have a concept of enums. Ensure that only valid enum values are present in the table - trying to save a non-defined enum will result in an exception.
Example

Consider the following Hive table, which coincidentally covers all types of Hive data types, making it a good example:

AS AVRO;

In this example we're pulling the source-of-truth reader schema from a webserver. Other options for providing the schema are described below.
Add the Avro files to the database (or create an external table) using standard Hive operations (http://wiki.apache.org/hadoop/Hive/LanguageManual/DML).
This table might result in a description as below:

Code Block
hive> describe kst;
OK
string1 string  from deserializer
string2 string  from deserializer
int1    int     from deserializer
boolean1        boolean from deserializer
long1   bigint  from deserializer
float1  float   from deserializer
double1 double  from deserializer
inner_record1   struct<int_in_inner_record1:int,string_in_inner_record1:string> from deserializer
enum1   string  from deserializer
array1  array<string>   from deserializer
map1    map<string,string>      from deserializer
union1  uniontype<float,boolean,string> from deserializer
fixed1  binary  from deserializer
null1   void    from deserializer
unionnullint    int     from deserializer
bytes1  binary  from deserializer

At this point, the Avro-backed table can be worked with in Hive like any other table.

Hive 0.14 and later versions

Since Hive 0.14, Avro-backed tables can simply be created by using "STORED AS AVRO" in DDL statement. AvroSerDe takes care of creating appropriate Avro schema from Hive table schema, a big win in terms of Avro usability in Hive.

For example:

Code Block
CREATE TABLE kst (
  string1 string,
  string2 string,
  int1 int,
  boolean1 boolean,
  long1 bigint,
  float1 float,
  double1 double,
  inner_record1 struct<int_in_inner_record1:int,string_in_inner_record1:string>,
  enum1 string,
  array1 array<string>,
  map1 map<string,string>,
 union1 uniontype<float,boolean,string>,
 fixed1 binary,
 null1 void,
 unionnullint int,
 bytes1 binary)
  PARTITIONED BY (ds string)
  STORED AS AVRO;

This table might result in a description as below:

Code Block
hive> describe kst;
OK
string1 string  from deserializer
string2 string  from deserializer
int1    int     from deserializer
boolean1        boolean from deserializer
long1   bigint  from deserializer
float1  float   from deserializer
double1 double  from deserializer
inner_record1   struct<int_in_inner_record1:int,string_in_inner_record1:string> from deserializer
enum1   string  from deserializer
array1  array<string>   from deserializer
map1    map<string,string>      from deserializer
union1  uniontype<float,boolean,string> from deserializer
fixed1  binary  from deserializer
null1   void    from deserializer
unionnullint    int     from deserializer
bytes1  binary  from deserializer

Writing tables to Avro files

The AvroSerde can serialize any Hive table to Avro files. This makes it effectively an any-Hive-type to Avro converter. In order to write a table to an Avro file, you must first create an appropriate Avro schema. Create as select type statements are not currently supported. Types translate as detailed in the table above. For types that do not translate directly, there are a few items to keep in mind:

  • Types that may be null must be defined as a union of that type and Null within Avro. A null in a field that is not so defined with result in an exception during the save. No changes need be made to the Hive schema to support this, as all fields in Hive can be null.
  • Avro Bytes type should be defined in Hive as lists of tiny ints. the AvroSerde will convert these to Bytes during the saving process.
  • Avro Fixed type should be defined in Hive as lists of tiny ints. the AvroSerde will convert these to Fixed during the saving process.
  • Avro Enum type should be defined in Hive as strings, since Hive doesn't have a concept of enums. Ensure that only valid enum values are present in the table - trying to save a non-defined enum will result in an exception.
Example

Consider the following Hive table, which coincidentally covers all types of Hive data types, making it a good example:

Code Block
CREATE TABLE test_serializer(string1 STRING,
  
Code Block
CREATE TABLE test_serializer(string1 STRING,
                             int1 INT,
                             tinyint1 TINYINT,
                             smallint1 SMALLINT,
                             bigint1 BIGINT,
                             boolean1 BOOLEAN,
                             float1 FLOAT,
                             double1 DOUBLE,
                             list1 ARRAY<STRING>,
                             map1 MAP<STRING,INT>,
                             struct1 STRUCT<sint:INT,sboolean:BOOLEAN,sstring:STRING>,
                             union1 uniontype<FLOAT, BOOLEAN, STRING>,
                             enum1 STRING,
                             nullableint INT,
                             bytes1 BINARY,
                             fixed1int1 BINARY)INT,
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' MAP KEYS TERMINATED BY '#' LINES TERMINATED BY '\n'
 STORED AS TEXTFILE;
 

To save this table as an Avro file, create an equivalent Avro schema (the namespace and actual name of the record are not important):

Code Block
{
  "namespace": "com.linkedin.haivvreo",
  "name": "test_serializer",
  "type": "record",
  "fields": [
    { "name":"string1", "type":"string" },
    { "name":"int1", "type":"int" },
    { "name":"tinyint1", "type":"int" },
    { "name":"smallint1", "type":"int" },
    { "name":"bigint1", "type":"long" },
    { "name":"boolean1", "type":"boolean" },
    { "name":"float1", "type":"float" },
    { "name":"double1", "type":"double" },
    { "name":"list1", "type":{"type":"array", "items":"string"} },
    { "name":"map1", "type":{"type":"map", "values":"int"} },
    { "name":"struct1", "type":{"type":"record", "name":"struct1_name", "fields": [
                             tinyint1 TINYINT,
                             smallint1 SMALLINT,
                             bigint1 BIGINT,
                             boolean1 BOOLEAN,
                             float1 FLOAT,
  { "name":"sInt", "type":"int" }, { "name":"sBoolean", "type":"boolean" }, { "name":"sString", "type":"string" } ] } },
                     { "name":"union1", "type":["float", "boolean", "string"] }     double1 DOUBLE,
    { "name":"enum1", "type":{"type":"enum", "name":"enum1_values", "symbols":["BLUE","RED", "GREEN"]} },
               { "name":"nullableint", "type":["int", "null"] },
    { "name":"bytes1", "type":"bytes" } list1 ARRAY<STRING>,
    { "name":"fixed1", "type":{"type":"fixed", "name":"threebytes", "size":3} }
  ] }

If the table were backed by a csv such as:

why hello there

42

3

100

1412341

true

42.43

85.23423424

alpha:beta:gamma

Earth#42:Control#86:Bob#31

17:true:Abe Linkedin

0:3.141459

BLUE

72

^A^B^C

^A^B^C

another record

98

4

101

9999999

false

99.89

0.00000009

beta

Earth#101

1134:false:wazzup

1:true

RED

NULL

^D^E^F^G

^D^E^F

third record

45

5

102

999999999

true

89.99

0.00000000000009

alpha:gamma

Earth#237:Bob#723

102:false:BNL

2:Time to go home

GREEN

NULL

^H

^G^H^I

one can write it out to Avro with:

                         map1 MAP<STRING,INT>,
                             struct1 STRUCT<sint:INT,sboolean:BOOLEAN,sstring:STRING>,
                             union1 uniontype<FLOAT, BOOLEAN, STRING>,
                             enum1 STRING,
                             nullableint INT,
                             bytes1 BINARY,
                             fixed1 BINARY)
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' MAP KEYS TERMINATED BY '#' LINES TERMINATED BY '\n'
 STORED AS TEXTFILE;
 

If the table were backed by a csv such as:

why hello there

42

3

100

1412341

true

42.43

85.23423424

alpha:beta:gamma

Earth#42:Control#86:Bob#31

17:true:Abe Linkedin

0:3.141459

BLUE

72

^A^B^C

^A^B^C

another record

98

4

101

9999999

false

99.89

0.00000009

beta

Earth#101

1134:false:wazzup

1:true

RED

NULL

^D^E^F^G

^D^E^F

third record

45

5

102

999999999

true

89.99

0.00000000000009

alpha:gamma

Earth#237:Bob#723

102:false:BNL

2:Time to go home

GREEN

NULL

^H

^G^H^I

All Hive versions

To save this table as an Avro file, create an equivalent Avro schema (the namespace and actual name of the record are not important):

Code Block
{
  "namespace": "com.linkedin.haivvreo",
  "name": "test_serializer",
  "type": "record",
  "fields": [
    { "name":"string1", "type":"string" },
    { "name":"int1", "type":"int" },
    { "name":"tinyint1", "type":"int" },
    { "name":"smallint1", "type":"int" },
    { "name":"bigint1", "type":"long" },
    { "name":"boolean1", "type":"boolean" },
    { "name":"float1", "type":"float" },
    { "name":"double1", "type":"double" },
    { "name":"list1", "type":{"type":"array", "items":"string"} },
    { "name":"map1", "type":{"type":"map", "values":"int"} },
    { "name":"struct1", "type":{"type":"record", "name":"struct1_name", "fields": [
          { "name":"sInt", "type":"int" }, { "name":"sBoolean", "type":"boolean" }, { "name":"sString", "type":"string" } ] } },
    { "name":"union1", "type":["float", "boolean", "string"] },
    { "name":"enum1", "type":{"type":"enum", "name":"enum1_values", "symbols":["BLUE","RED", "GREEN"]} },
    { "name":"nullableint", "type":["int", "null"] },
    { "name":"bytes1", "type":"bytes" },
    { "name":"fixed1", "type":{"type":"fixed", "name":"threebytes", "size":3} }
  ] }

one can write it out to Avro with:

Code Block
CREATE TABLE as_avro
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED as INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  TBLPROPERTIES (
    'avro.schema.url'='file:///path/to/the/schema/test_serializer.avsc');
insert overwrite table as_avro select * from test_serializer;

Hive 0.14 and later

In Hive versions 0.14 and later, one does not need to create Avro schema manually. Above procedure to save a table as an Avro file reduces to just a DDL statement followed by an insert into the table.

Code Block
CREATE TABLE as_avro(string1 STRING,
                     int1 INT,
                     tinyint1 TINYINT,
                     smallint1 SMALLINT,
                     bigint1 BIGINT,
                     boolean1 BOOLEAN,
                     float1 FLOAT,
                     double1 DOUBLE,
                     list1 ARRAY<STRING>,
                     map1 MAP<STRING,INT>,
                     struct1 STRUCT<sint:INT,sboolean:BOOLEAN,sstring:STRING>,
                     union1 uniontype<FLOAT, BOOLEAN, STRING>,
                     enum1 STRING,
                     nullableint INT,
                     bytes1 BINARY,
                     fixed1 BINARY)
STORED AS AVRO
Code Block
CREATE TABLE as_avro
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED as INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  TBLPROPERTIES (
    'avro.schema.url'='file:///path/to/the/schema/test_serializer.avsc');
insert overwrite table as_avro select * from test_serializer;

...