Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: rename SerDe section, link to "how to add" & other docs, reformat (best guess – NEEDS WORK)

...

  • trunk/conf - This directory contains the packaged hive-default.xml and hive-site.xml.
  • trunk/data - This directory contains some data sets and configurations used in the hive tests.
  • trunk/ivy - This directory contains the ivy files used by the build infrastructure to manage dependencies on different hadoop versions.
  • trunk/lib - This directory contains the run time libraries needed by Hive.
  • trunk/testlibs - This directory contains the junit.jar used by the junit target in the build infrastructure.
  • trunk/testutils (Deprecated)

Hive SerDe

What is a !SerDe?

  • !SerDe is a short name for "Serializer and Deserializer."
  • Hive uses SerDe (and !FileFormat) to read and write table rows.
  • HDFS files -!InputFileFormat)--> <key, value> --(Deserializer-> Row object
  • Row object -Serializer)--> <key, value> --(!OutputFileFormat-> HDFS files

...

  • !MetadataTypedColumnsetSerDe: This !SerDe is used to read/write delimited records like CSV, tab-separated control-A separated records (sorry, quote is not supported yet.)
  • !ThriftSerDe: This !SerDe is used to read/write thrift serialized objects. The class file for the Thrift object must be loaded first.
  • !DynamicSerDe: This !SerDe also read/write thrift serialized objects, but it understands thrift DDL so the schema of the object can be provided at runtime. Also it supports a lot of different protocols, including !TBinaryProtocol, !TJSONProtocol, TCTL!SeparatedProtocol (which writes data in delimited records).

An Avro SerDe was added in Hive 0.9.1, and a SerDe for the ORC file format was added in Hive 0.11.0.

See SerDe for detailed information about input and output processing. Also see Storage Formats in the HCatalog manual, including CTAS Issue with JSON SerDe.

How to write your own !SerDe:

  • In most cases, users want to write a Deserializer instead of a !SerDe, because users just want to read their own data format instead of writing to it.
  • For example, the !RegexDeserializer will deserialize the data using the configuration parameter 'regex', and possibly a list of column names (see serde2.MetadataTypedColumnsetSerDe). Please see serde2/Deserializer.java for details.
  • If your !SerDe supports DDL (basically, !SerDe with parameterized columns and column types), you probably want to implement a Protocol based on !DynamicSerDe, instead of writing a !SerDe from scratch. The reason is that the framework passes DDL to !SerDe through "thrift DDL" format, and it's non-trivial to write a "thrift DDL" parser.
  • For examples, see SerDe - how to add a new SerDe below.

Some important points about !SerDe:

...