Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: add link to DDL doc for JsonSerde

...

  • MetadataTypedColumnsetSerDe: This SerDe is used to read/write delimited records like CSV, tab-separated control-A separated records (sorry, quote is not supported yet).)ThriftSerDe
  • LazySimpleSerDe: This SerDe

    is

    can be used to read

    /write Thrift serialized objects. The class file for the Thrift object must be loaded first.
  • DynamicSerDe: This SerDe also read/write Thrift serialized objects, but it understands Thrift DDL so the schema of the object can be provided at runtime. Also it supports a lot of different protocols, including TBinaryProtocol, TJSONProtocol, TCTLSeparatedProtocol (which writes data in delimited records).

Also:

  • For JSON files, an Amazon SerDe is available at s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar.
  • An Avro SerDe was added in Hive 0.9.1.
  • A SerDe for the ORC file format was added in Hive 0.11.0.

See SerDe for detailed information about input and output processing. Also see Storage Formats in the HCatalog manual, including CTAS Issue with JSON SerDe. For information about how to create a table with a custom or native SerDe, see Row Format, Storage Format, and SerDe.

How to write your own SerDe:

  • In most cases, users want to write a Deserializer instead of a SerDe, because users just want to read their own data format instead of writing to it.
  • For example, the RegexDeserializer will deserialize the data using the configuration parameter 'regex', and possibly a list of column names (see serde2.MetadataTypedColumnsetSerDe). Please see serde2/Deserializer.java for details.
  • If your SerDe supports DDL (basically, SerDe with parameterized columns and column types), you probably want to implement a Protocol based on DynamicSerDe, instead of writing a SerDe from scratch. The reason is that the framework passes DDL to SerDe through "Thrift DDL" format, and it's non-trivial to write a "Thrift DDL" parser.
  • For examples, see SerDe - how to add a new SerDe below.

Some important points about SerDe:

  • SerDe, not the DDL, defines the table schema. Some SerDe implementations use the DDL for configuration, but the SerDe can also override that.
  • Column types can be arbitrarily nested arrays, maps, and structures.
  • The callback design of ObjectInspector allows lazy deserialization with CASE/IF or when using complex or nested types.

ObjectInspector

Hive uses ObjectInspector to analyze the internal structure of the row object and also the structure of the individual columns.

ObjectInspector provides a uniform way to access complex objects that can be stored in multiple formats in the memory, including:

  • Instance of a Java class (Thrift or native Java)
  • A standard Java object (we use java.util.List to represent Struct and Array, and use java.util.Map to represent Map)
  • A lazily-initialized object (For example, a Struct of string fields stored in a single Java string object with starting offset for each field)

A complex object can be represented by a pair of ObjectInspector and Java Object. The ObjectInspector not only tells us the structure of the Object, but also gives us ways to access the internal fields inside the Object.

...

  • the same data format as MetadataTypedColumnsetSerDe and TCTLSeparatedProtocol, however, it creates Objects in a lazy way which provides better performance. Starting in Hive 0.14.0 it also supports read/write data with a specified encode charset, for example:

    Code Block
    ALTER TABLE person SET SERDEPROPERTIES ('serialization.encoding'='GBK');

    LazySimpleSerDe can treat 'T', 't', 'F', 'f', '1', and '0' as extended, legal boolean literals if the configuration property hive.lazysimple.extended_boolean_literal is set to true (Hive 0.14.0 and later). The default is false, which means only 'TRUE' and 'FALSE' are treated as legal boolean literals.

  • ThriftSerDe: This SerDe is used to read/write Thrift serialized objects. The class file for the Thrift object must be loaded first.
  • DynamicSerDe: This SerDe also read/write Thrift serialized objects, but it understands Thrift DDL so the schema of the object can be provided at runtime. Also it supports a lot of different protocols, including TBinaryProtocol, TJSONProtocol, TCTLSeparatedProtocol (which writes data in delimited records).

Also:

  • For JSON files, JsonSerDe was added in Hive 0.12.0. An Amazon SerDe is available at s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar for releases prior to 0.12.0.
  • An Avro SerDe was added in Hive 0.9.1.  Starting in Hive 0.14.0 its specification is implicit with the STORED AS AVRO clause.
  • A SerDe for the ORC file format was added in Hive 0.11.0.
  • A SerDe for Parquet was added via plug-in in Hive 0.10 and natively in Hive 0.13.0.
  • A SerDe for CSV was added in Hive 0.14.

See SerDe for detailed information about input and output processing. Also see Storage Formats in the HCatalog manual, including CTAS Issue with JSON SerDe. For information about how to create a table with a custom or native SerDe, see Row Format, Storage Format, and SerDe.

How to Write Your Own SerDe

  • In most cases, users want to write a Deserializer instead of a SerDe, because users just want to read their own data format instead of writing to it.
  • For example, the RegexDeserializer will deserialize the data using the configuration parameter 'regex', and possibly a list of column names (see serde2.MetadataTypedColumnsetSerDe). Please see serde2/Deserializer.java for details.
  • If your SerDe supports DDL (basically, SerDe with parameterized columns and column types), you probably want to implement a Protocol based on DynamicSerDe, instead of writing a SerDe from scratch. The reason is that the framework passes DDL to SerDe through "Thrift DDL" format, and it's non-trivial to write a "Thrift DDL" parser.
  • For examples, see SerDe - how to add a new SerDe below.

Some important points about SerDe:

  • SerDe, not the DDL, defines the table schema. Some SerDe implementations use the DDL for configuration, but the SerDe can also override that.
  • Column types can be arbitrarily nested arrays, maps, and structures.
  • The callback design of ObjectInspector allows lazy deserialization with CASE/IF or when using complex or nested types.

ObjectInspector

Hive uses ObjectInspector to analyze the internal structure of the row object and also the structure of the individual columns.

ObjectInspector provides a uniform way to access complex objects that can be stored in multiple formats in the memory, including:

  • Instance of a Java class (Thrift or native Java)
  • A standard Java object (we use java.util.List to represent Struct and Array, and use java.util.Map to represent Map)
  • A lazily-initialized object (for example, a Struct of string fields stored in a single Java string object with starting offset for each field)

A complex object can be represented by a pair of ObjectInspector and Java Object. The ObjectInspector not only tells us the structure of the Object, but also gives us ways to access the internal fields inside the Object.

NOTE: Apache Hive recommends that custom ObjectInspectors created for use with custom SerDes have a no-argument constructor in addition to their normal constructors for serialization purposes. See HIVE-5380 for more details.

Registration of Native SerDes

As of Hive 0.14 a registration mechanism has been introduced for native Hive SerDes.  This allows dynamic binding between a "STORED AS" keyword in place of a triplet of {SerDe, InputFormat, and OutputFormat} specification, in CreateTable statements.

The following mappings have been added through this registration mechanism:

SyntaxEquivalent

STORED AS AVRO /

STORED AS AVROFILE

ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

STORED AS ORC /

STORED AS ORCFILE

ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
  STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'

STORED AS PARQUET /

STORED AS PARQUETFILE

ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
  STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
STORED AS RCFILE
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
STORED AS SEQUENCEFILE
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.SequenceFileInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.mapred.SequenceFileOutputFormat'
STORED AS TEXTFILE
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'

To add a new native SerDe with STORED AS keyword, follow these steps:

  1. Create a storage format descriptor class extending from AbstractStorageFormatDescriptor.java that returns a "stored as" keyword and the names of InputFormat, OutputFormat, and SerDe classes.

  2. Add the name of the storage format descriptor class to the StorageFormatDescriptor registration file

...

  1. .

MetaStore

MetaStore contains metadata regarding tables, partitions and databases. This is used by Query Processor during plan generation.

...

  • Test Queries:
    • queries/clientnegative - This directory contains the query files (.q files) for the negative test cases. These are run through the CLI classes and therefore test the entire query processor stack.
    • queries/clientpositive - This directory contains the query files (.q files) for the positive test cases. Thesre are run through the CLI classes and therefore test the entire query processor stack.
    • qureies/positive (Will be deprecated) - This directory contains the query files (.q files) for the positive test cases for the compiler. These only test the compiler and do not run the execution code.
    • queries/negative (Will be deprecated) - This directory contains the query files (.q files) for the negative test cases for the compiler. These only test the compiler and do not run the execution code.
  • Test Results:
    • results/clientnegative - The expected results from the queries in queries/clientnegative.
    • results/clientpositive - The expected results from the queries in queries/clientpositive.
    • results/compiler/errors - The expected results from the queries in queries/negative.
    • results/compiler/parse - The expected Abstract Syntax Tree output for the queries in queries/positive.
    • results/compiler/plan - The expected query plans for the queries in queries/positive.
  • Velocity Templates to Generate the Tests:
    • templates/TestCliDriver.vm - Generates the tests from queries/clientpositive.
    • templates/TestNegativeCliDriver.vm - Generates the tests from queries/clientnegative.
    • templates/TestParse.vm - Generates the tests from queries/positive.
    • templates/TestParseNegative.vm - Generates the tests from queries/negative.

...

Running unit tests

Note
titleAnt to Maven

As of version 0.13 Hive uses Maven instead of Ant for its build. The following instructions are not up to date.

See the Hive Developer FAQ for updated instructions.

...

Note that this option matches against the basename of the test without the .q suffix.

Apparently the Hive tests do not run successfully after a clean unless you run ant package first. Not sure why build.xml doesn't encode this dependency.

Adding new unit tests

.

Apparently the Hive tests do not run successfully after a clean unless you run ant package first. Not sure why build.xml doesn't encode this dependency.

Adding new unit tests

Note
titleAnt to Maven

As of version 0.13 Hive uses Maven instead of Ant for its build. The following instructions are not up to date.

See the Hive Developer FAQ for updated instructions. See also Tips for Adding New Tests in Hive and How to Contribute: Add a Unit Test.

First, write a new myname.q in ql/src/test/queries/clientpositive.

...

Similarly, to add negative client tests, write a new query input file in ql/src/test/queries/clientnegative and run the same command, this time specifying the testcase name as TestNegativeCliDriver instead of TestCliDriver. Note that for negative client tests, the output file if created using the overwrite flag can be be found in the directory ql/src/test/results/clientnegative.See also Tips for Adding New Tests in Hive.

Debugging Hive Code

Anchor
DebuggingHiveCode
DebuggingHiveCode

Hive code includes both client-side code (e.g., compiler, semantic analyzer, and optimizer of HiveQL) and server-side code (e.g., operator/task/SerDe implementations). Debugging is different for client-side and server-side code, as described below.

...

Please refer to Hive User Group Meeting August 2009 Page 74-87.