Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Organization and a brief architecture

Introduction

Hive comprises of has 3 main components:

  • Serializers/Deserializers (trunk/serde) - This component has the framework libraries that allow users to develop serializers and deserializers for their own data formats. This component also contains some builtin serialization/deserialization families.
  • MetaStore (trunk/metastore) - This component implements the metadata server, which is used to hold all the information about the tables and partitions that are in the warehouse.
  • Query Processor (trunk/ql) - This component implements the processing framework for converting SQL to a graph of map/reduce jobs and also the execution time framework to run those jobs in the order of dependencies.

...

  • trunk/conf - This directory contains the packaged hive-default.xml and hive-site.xml.
  • trunk/data - This directory contains some data sets and configurations used in the hive tests.
  • trunk/ivy - This directory contains the ivy files used by the build infrastructure to manage dependencies on different hadoop versions.
  • trunk/lib - This directory contains the run time libraries needed by Hive.
  • trunk/testlibs - This directory contains the junit.jar used by the junit target in the build infrastructure.
  • trunk/testutils (Deprecated)

    SerDe

    What is a !SerDe
  • !SerDe is a short name for "Serializer and Deserializer."
  • Hive uses SerDe (and !FileFormat) to read from/write to tablesand write table rows.
  • HDFS files !InputFileFormat)> <key, value> -(Deserializer-> Row object
  • Row object Serializer)> <key, value> -(!OutputFileFormat-> HDFS files

Note that the "key" part is ignored when reading, and is always a constant when writing. Basically the row object is only stored into the "value".

One principle of Hive is that Hive does not own the HDFS file format - . Users should be able to directly read the HDFS files in the Hive tables using other tools , or use other tools to directly write to HDFS files that can be read by loaded into Hive through "CREATE EXTERNAL TABLE" , or can be loaded into Hive through "LOAD DATA INPATH," which just move the file into Hive's table directory.

Note that org.apache.hadoop.hive.serde is the deprecated old serde library. Please look at org.apache.hadoop.hive.serde2 for the latest version.

...

  • In most cases, users want to write a Deserializer instead of a !SerDe, because users just want to read their own data format instead of writing to it.
  • For example, the !RegexDeserializer will deserialize the data using the configuration parameter 'regex', and possibly a list of column names (see serde2.MetadataTypedColumnsetSerDe). Please see serde2/Deserializer.java for details.
  • If your !SerDe supports DDL (basically, !SerDe with parameterized columns and column types), you probably want to implement a Protocol based on !DynamicSerDe, instead of writing a !SerDe from scratch. The reason is that the framework passes DDL to !SerDe through "thrift DDL" format, and it's non-trivial to write a "thrift DDL" parser.

Some important points of about !SerDe:

  • !SerDe, not the DDL, defines the table schema. Some !SerDe implementations use the DDL for configuration, but the !SerDe can also override that.
  • Column types can be arbitrarily nested arrays, maps, and structures.
  • The callback design of !ObjectInspector allows lazy deserialization with CASE/IF or when using complex or nested types.

    ObjectInspector

    Hive uses !ObjectInspector to analyze the internal structure of the row object and also the structure of the individual columns.

...