Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: realign ObjectInspector section

Developer Guide

Table of Contents

Code Organization and a brief architecture

...

  • !SerDe, not the DDL, defines the table schema. Some !SerDe implementations use the DDL for configuration, but the !SerDe can also override that.
  • Column types can be arbitrarily nested arrays, maps, and structures.
  • The callback design of !ObjectInspector allows lazy deserialization with CASE/IF or when using complex or nested types.

ObjectInspector

Hive uses !ObjectInspector to analyze the internal structure of the row object and also the structure of the individual columns.

!ObjectInspector provides a uniform way to access complex objects that can be stored in multiple formats in the memory, including:

...

From the root of the source tree:

Code Block

ant package

will make Hive compile against hadoop version 0.19.0. Note that:

...

  • One can specify a custom distribution directory by using:
Code Block

ant -Dtarget.dir=<my-install-dir> package
  • One can specify a version of hadoop other than 0.19.0 by using (using 0.17.1 as an example):
Code Block

ant -Dhadoop.version=0.17.1 package
  • One can also compile against a custom version of the Hadoop tree (only release 0.4 and above). This is also useful if running Ivy is problematic (in disconnected mode for example) - but a hadoop tree is available. This can be done by specifying the root of the hadoop source tree to be used, for example:
Code Block

ant -Dhadoop.root=~/src/hadoop-19/build/hadoop-0.19.2-dev -Dhadoop.version=0.19.2-dev

...

Run hive from the command line with '$HIVE_HOME/bin/hive', where $HIVE_HOME is typically build/dist under your hive repository top-level directory.

Code Block

$ build/dist/bin/hive

If hive fails at runtime, try 'ant very-clean package' to delete the ivy cache before rebuilding.

Running Hive Without a Hadoop Cluster

From Thejas:

Code Block

export HIVE_OPTS='-hiveconf mapred.job.tracker=local -hiveconf fs.default.name=file:///tmp \
    -hiveconf hive.metastore.warehouse.dir=file:///tmp/warehouse \
    -hiveconf javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/tmp/metastore_db;create=true'

...

Running unit tests

Run all tests:

Code Block

ant package test

Run all positive test queries:

Code Block

ant test -Dtestcase=TestCliDriver

Run a specific positive test query:

Code Block

ant test -Dtestcase=TestCliDriver -Dqfile=groupby1.q

...

Run the set of unit tests matching a regex, e.g. partition_wise_fileformat tests 10-16:

Code Block

ant test -Dtestcase=TestCliDriver -Dqfile_regex=partition_wise_fileformat1[0-6]

...

Then, run the test with the query and overwrite the result (useful when you add a new test)

Code Block

ant test -Dtestcase=TestCliDriver -Dqfile=myname.q -Doverwrite=true

Then we can create a patch by:

Code Block

svn add ql/src/test/queries/clientpositive/myname.q ql/src/test/results/clientpositive/myname.q.out
svn diff > patch.txt

...

The server-side code is distributed and running on the Hadoop cluster, so debugging server-side Hive code is a little bit complicated. In addition to printing to log files using log4j, you can also attach the debugger to a different JVM under unit test (single machine mode). Below are the steps on how to debug on server-side code.

  • Compile Hive code with javac.debug=on. Under Hive checkout directory.

    Code Block
    
        > ant -Djavac.debug=on package
    

    If you have already built Hive without javac.debug=on, you can clean the build and then run the above command.

    Code Block
    
        > ant clean  # not necessary if the first time to compile
        > ant -Djavac.debug=on package
    
  • Run ant test with additional options to tell the Java VM that is running Hive server-side code to wait for the debugger to attach. First define some convenient macros for debugging. You can put it in your .bashrc or .cshrc.

    Code Block
    
        > export HIVE_DEBUG_PORT=8000
        > export HIVE_DEBUG="-Xdebug -Xrunjdwp:transport=dt_socket,address=${HIVE_DEBUG_PORT},server=y,suspend=y"
    

    In particular HIVE_DEBUG_PORT is the port number that the JVM is listening on and the debugger will attach to. Then run the unit test as follows:

    Code Block
    
        > export HADOOP_OPTS=$HIVE_DEBUG
        > ant test -Dtestcase=TestCliDriver -Dqfile=<mytest>.q
    

    The unit test will run until it shows:

    Code Block
    
         [junit] Listening for transport dt_socket at address: 8000
    
  • Now, you can use jdb to attach to port 8000 to debug

    Code Block
    
        > jdb -attach 8000
    

    or if you are running Eclipse and the Hive projects are already imported, you can debug with Eclipse. Under Eclipse Run -> Debug Configurations, find "Remote Java Application" at the bottom of the left panel. There should be a MapRedTask configuration already. If there is no such configuration, you can create one with the following property:

  • Name: any task such as MapRedTask
  • Project: the Hive project that you imported.
  • Connection Type: Standard (Socket Attach)
  • Connection Properties:
    • Host: localhost
    • Port: 8000
      Then hit the "Debug" button and Eclipse will attach to the JVM listening on port 8000 and continue running till the end. If you define breakpoints in the source code before hitting the "Debug" button, it will stop there. The rest is the same as debugging client-side Hive.

There is another way of debugging hive code without going through ant.
You need to install hadoop and set the environment variable HADOOP_HOME to that.

Code Block

    > export HADOOP_HOME=<your hadoop home>
 

Then, start hive:

Code Block

    >  ./build/dist/bin/hive --debug
 

It will then act similar to the debugging steps outlines in Debugging Hive code. It is faster since there is no need to compile hive code,
and go through ant. It can be used to debug both client side and server side hive. If you want to debug a particular query, start hive
and perform the stops needed before that query. Then start, hive again in debug to debug that query.

Code Block

    >  ./build/dist/bin/hive
    >  perform steps before the query
 
Code Block

    >  ./build/dist/bin/hive --debug
    >  run the query
 

...