Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: fix link to Row Formats & SerDe after heading changed in DDL doc

...

See Hive SerDe for an introduction to SerDes.

Built-in and Custom SerDes

Anchor
Built-in, Third-Party, and Custom SerDes
Built-in, Third-Party, and Custom SerDes
The Hive SerDe library is in org.apache.hadoop.hive.serde2. (The old SerDe library in org.apache.hadoop.hive.serde is deprecated.)

...

Third-party SerDes

Note: For Hive releases prior to 0.12, Amazon provides For JSON files, Amazon has provided       a JSON SerDe available at :s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar.

Custom SerDes

For information about custom SerDes, see How to Write Your Own SerDe in the Developer Guide.

...

For the HiveQL statements that specify SerDes and their properties, see Create Table (particularly Row Format, Storage Format, and Formats & SerDe) and Alter Table (Add SerDe Properties).

...

  • Hive's execution engine (referred to as just engine henceforth) first uses the configured InputFormat to read in a record of data (the value object returned by the RecordReader of the InputFormat).
  • The engine then invokes Serde.deserialize() to perform deserialization of the record. There is no real binding that the deserialized object returned by this method indeed be a fully deserialized one. For instance, in Hive there is a LazyStruct object which is used by the LazySimpleSerde LazySimpleSerDe to represent the deserialized object. This object does not have the bytes deserialized up front but does at the point of access of a field.
  • The engine also gets hold of the ObjectInspector to use by invoking Serde.getObjectInspector(). This has to be a subclass of structObjectInspector since a record representing a row of input data is essentially a struct type.
  • The engine passes the deserialized object and the object inspector to all operators for their use in order to get the needed data from the record. The object inspector knows how to construct individual fields out of a deserialized record. For example, StructObjectInspector has a method called getStructFieldData() which returns a certain field in the record. This is the mechanism to access individual fields. For instance ExprNodeColumnEvaluator class which can extract a column from the input row uses this mechanism to get the real column object from the serialized row object. This real column object in turn can be a complex type (like a struct). To access sub fields in such complex typed objects, an operator would use the object inspector associated with that field (The top level StructObjectInspector for the row maintains a list of field level object inspectors which can be used to interpret individual fields).

...