Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • !SerDe is a short name for Serializer and Deserializer.
  • Hive uses SerDe (and !FileFormat) to read from/write to tables.
  • HDFS files !InputFileFormat)> <key, value> --(Deserializer-> Row object
  • Row object Serializer)> <key, value> --(!OutputFileFormat-> HDFS files

Note that the "key" part is ignored when reading, and is always a constant when writing. Basically the row object is only stored into the "value".

...

  • Parse and SemanticAnalysis (ql/parse) - This component contains the code for parsing SQL, converting it into Abstract Syntax Trees, converting the Abstract Syntax Trees into Operator Plans and finally converting the operator plans into a directed graph of tasks which are executed by Driver.java.
  • Optimizer (ql/optimizer) - This component contains some simple rule based optimizations like pruning non referenced columns from table scans (column pruning) that the Hive Query Processor does while converting SQL to a series of map/reduce tasks.
  • Plan Components (ql/plan) - This component contains the classes (which are called descriptors), that are used by the compiler (Parser, SemanticAnalysis and Optimizer) to pass the information to operator trees that is used by the execution code.
  • MetaData Layer (ql/metadata) - This component is used by the query processor to interface with the MetaStore in order to retrieve information about tables, partitions and the columns of the table. This information is used by the compiler to compile SQL to a series of map/reduce tasks.
  • Map/Reduce Execution Engine (ql/exec) - This component contains all the query operators and the framework that is used to invoke those operators from within the map/reduces tasks.
  • Hadoop Record Readers, Input and Output Formatters for Hive (ql/io) - This component contains the record readers and the input, output formatters that Hive registers with a Hadoop Job.
  • Sessions (ql/session) - A rudimentary session implementation for Hive.
  • Type interfaces (ql/typeinfo) - This component provides all the type information for table columns that is retrieved from the MetaStore and the SerDes.
  • Hive Function Framework (ql/udf) - Framework and implementation of Hive operators, Functions and Aggregate Functions. This component also contains the interfaces that a user can implement to create user defined functions.
  • Tools (ql/tools) - Some simple tools provided by the query processing framework. Currently, this component contains the implementation of the lineage tool that can parse the query and show the source and destination tables of the query.

    Compiler

Parser

TypeChecking

Semantic Analysis

Plan generation

Task generation

Execution Engine

Plan

Operators

UDFs and UDAFs

Compiling and Running Hive

Hive can be made to compile against different versions of Hadoop.

Default Mode

From the root of the source tree:

Code Block
ant package

will make Hive compile against hadoop version 0.19.0. Note that:

...

In this particular example - ~/src/hadoop-19 is a checkout of the hadoop 19 branch that uses 0.19.2-dev as default version and creates a distribution directory in build/hadoop-0.19.2-dev by default.

Running Hive locally

From Thejas:  

Code Block

export HIVE_OPTS='-hiveconf mapred.job.tracker=local -hiveconf fs.default.name=file:///tmp \
    -hiveconf hive.metastore.warehouse.dir=file:///tmp/warehouse \
    -hiveconf javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/tmp/metastore_db;create=true'

Then you can run 'bin/hive' and it will work against your local file system

Unit tests and debugging

Layout of the unit tests

...

  • Test Queries:
    • queries/clientnegative - This directory contains the query files (.q files) for the negative test cases. These are run through the CLI classes and therefore test the entire query processor stack.
    • queries/clientpositive - This directory contains the query files (.q files) for the positive test cases. Thesre are run through the CLI classes and therefore test the entire query processor stack.
    • qureies/positive (Will be deprecated) - This directory contains the query files (.q files) for the positive test cases for the compiler. These only test the compiler and do not run the execution code.
    • queries/negative (Will be deprecated) - This directory contains the query files (.q files) for the negative test cases for the compiler. These only test the compiler and do not run the execution code.
  • Test Results:
    • results/clientnegative - The expected results from the queries in queries/clientnegative.
    • results/clientpositive - The expected results from the queries in queries/clientpositive.
    • results/compiler/errors - The expected results from the queries in queries/negative.
    • results/compiler/parse - The expected Abstract Syntax Tree output for the queries in queries/positive.
    • results/compiler/plan - The expected query plans for the queries in queries/positive.
  • Velocity Templates to Generate the tests:
    • templates/!TestCliDriver.vm - Generates the tests from queries/clientpositive.
    • templates/!TestNegativeCliDriver.vm - Generates the tests from queries/clientnegative.
    • templates/!TestParse.vm - Generates the tests from queries/positive.
    • templates/!TestParseNegative.vm - Generates the tests from queries/negative.

      Tables in the unit tests

Running unit tests

Run all tests:

Code Block
ant test

Run all positive test queries:

...