Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: minor edits: fix capitalization for headings, hive, hadoop, and cli

...

In the rest of the page we use build/dist and <install-dir> interchangeably.

Compile

...

Hive on

...

Hadoop 23

No Format
  $ svn co http://svn.apache.org/repos/asf/hive/trunk hive
  $ cd hive
  $ ant clean package -Dhadoop.version=0.23.3 -Dhadoop-0.23.version=0.23.3 -Dhadoop.mr.rev=23
  $ ant clean package -Dhadoop.version=2.0.0-alpha -Dhadoop-0.23.version=2.0.0-alpha -Dhadoop.mr.rev=23

Running Hive

Hive uses hadoopHadoop, so:

  • you must have hadoop Hadoop in your path OR
  • export HADOOP_HOME=<hadoop-install-dir>

...

No Format
  $ export HIVE_HOME=<hive-install-dir>

To use the hive Hive command line interface (cliCLI) from the shell:

No Format
  $ $HIVE_HOME/bin/hive

...

To use the HCatalog command line interface (cliCLI) in Hive release 0.11.0 and later:

...

For more information, see WebHCat Installation in the WebHCat manual.

Configuration

...

Management Overview

  • Hive by default gets its configuration from <install-dir>/conf/hive-default.xml
  • The location of the Hive configuration directory can be changed by setting the HIVE_CONF_DIR environment variable.
  • Configuration variables can be changed by (re-)defining them in <install-dir>/conf/hive-site.xml
  • Log4j configuration is stored in <install-dir>/conf/hive-log4j.properties
  • Hive configuration is an overlay on top of hadoop Hadoop – it inherits the hadoop Hadoop configuration variables by default.
  • Hive configuration can be manipulated by:
    • Editing hive-site.xml and defining any desired variables (including hadoop Hadoop variables) in it
    • From the cli CLI using the set command (see below)
    • Invoking hive using the syntax:
      • $ bin/hive -hiveconf x1=y1 -hiveconf x2=y2
        this sets the variables x1 and x2 to y1 and y2 respectively
    • Setting the HIVE_OPTS environment variable to "-hiveconf x1=y1 -hiveconf x2=y2" which does the same as above.

Runtime

...

Configuration

  • Hive queries are executed using map-reduce queries and, therefore, the behavior of such queries can be controlled by the hadoop Hadoop configuration variables.
  • The cli CLI command 'SET' can be used to set any hadoop Hadoop (or hiveHive) configuration variable. For example:
    No Format
        hive> SET mapred.job.tracker=myhost.mycompany.com:50030;
        hive> SET -v;
    
    The latter shows all the current settings. Without the -v option only the variables that differ from the base hadoop Hadoop configuration are displayed.

...

While this usually points to a map-reduce cluster with multiple nodes, Hadoop also offers a nifty option to run map-reduce jobs locally on the user's workstation. This can be very useful to run queries over small data sets – in such cases local mode execution is usually significantly faster than submitting jobs to a large cluster. Data is accessed transparently from HDFS. Conversely, local mode only runs with one reducer and can be very slow processing larger data sets.

Starting v-with release 0.7, Hive fully supports local mode execution. To enable this, the user can enable the following option:

...

In addition, mapred.local.dir should point to a path that's valid on the local machine (for example /tmp/<username>/mapred/local). (Otherwise, the user will get an exception allocating local disk space.)

Starting v-with release 0.7, Hive also supports a mode to run map-reduce jobs in local-mode automatically. The relevant options are hive.exec.mode.local.auto, hive.exec.mode.local.auto.inputbytes.max, and hive.exec.mode.local.auto.tasks.max:

...

Note that there may be differences in the runtime environment of hadoop Hadoop server nodes and the machine running the hive Hive client (because of different jvm versions or different software libraries). This can cause unexpected behavior/errors while running in local mode. Also note that local mode execution is done in a separate, child jvm (of the hive Hive client). If the user so wishes, the maximum amount of memory for this child jvm can be controlled via the option hive.mapred.local.mem. By default, it's set to zero, in which case Hive lets Hadoop determine the default memory limits of the child jvm.

...

Hive also stores query logs on a per hive Hive session basis in /tmp/<user.name>/, but can be configured in hive-site.xml with the hive.querylog.location property.

...

When using local mode (using mapred.job.tracker=local), Hadoop/Hive execution logs are produced on the client machine itself. Starting v-with release 0.6 – Hive uses the hive-exec-log4j.properties (falling back to hive-log4j.properties only if it's missing) to determine where these logs are delivered by default. The default configuration file produces one log file per query executed in local mode and stores it under /tmp/<user.name>. The intent of providing a separate configuration file is to enable administrators to centralize execution log capture if desired (on a NFS file server for example). Execution logs are invaluable for debugging run-time errors.

...

Metadata is in an embedded Derby database whose disk storage location is determined by the hive Hive configuration variable named javax.jdo.option.ConnectionURL. By default this location is ./metastore_db (see conf/hive-default.xml).

...

Some example queries are shown below. They are available in build/dist/examples/queries.
More are available in the hive Hive sources at ql/src/test/queries/positive.

...

Note that in all the examples that follow, INSERT (into a hive Hive table, local directory or HDFS directory) is optional.

...

This streams the data in the map phase through the script /bin/cat (like hadoop Hadoop streaming).
Similarly – streaming can be used on the reduce side (please see the Hive Tutorial for examples).

...