Table of Contents
Table of Contents |
---|
Installation and Configuration
You can install a stable release of Hive by downloading a tarball, or you can download the source code and build Hive from that.
Running HiveServer2 and Beeline
Requirements
- Java 1.7 (preferred), Java 1.6.
Note: Hive version 0.15 will use versions 1.2 onward require Java 1.7 or newer. Hive versions 0.14 to 1.1 work with Java 1.6 as well. Users are strongly advised to start moving to Java 1.8 (see HIVE-8607). - Hadoop 22.x (preferred), 1.x (not supported by Hive 2.0.0 onward).
Hive versions up to 0.13 also supported Hadoop 0.20.x, 0.23.x. - Hive is commonly used in production Linux and Windows environment. Mac is a commonly used development environment. The instructions in this document are applicable to Linux and Mac. Using it on Windows would require slightly different steps.
Installing Hive from a Stable Release
...
Building Hive from Source
The Hive SVN GIT repository for the most recent Hive code is located here: httpgit clone https://
svngit-wip-us.apache.org/repos/asf/hive
/trunk. The repository for branches containing released versions is located here: http://svn.apache.org/repos/asf/hive/branches. All of the .git
(the master branch).
All release versions are in subdirectories branches named "branch-0.#" or "branch-1.#" or the upcoming "branch-2.#" (except , with the exception of release 0.8.1 which is in "branch-0.8-r2"); any subdirectories . Any branches with other names are feature branches for works-in-progress. See Understanding Hive Branches for details.
As of 0.13, Hive is built using Apache Maven.
Compile Hive on
...
master
To build the current Hive code on Hadoop 0.23 or laterfrom the master branch:
No Format |
---|
$ svngit coclone httphttps://svngit-wip-us.apache.org/repos/asf/hive/trunk hive.git $ cd hive $ mvn clean installpackage -Pdist [-Phadoop-2,distDskipTests -Dmaven.javadoc.skip=true] $ cd packaging/target/apache-hive-{version}-SNAPSHOT-bin/apache-hive-{version}-SNAPSHOT-bin $ ls LICENSE NOTICE README.txt RELEASE_NOTES.txt bin/ (all the shell scripts) lib/ (required jar files) conf/ (configuration files) examples/ (sample input and query files) hcatalog / (hcatalog installation) scripts / (upgrade scripts for hive-metastore) |
...
If building Hive source using Maven (mvn), we will refer to the directory "/packaging/target/apache-hive-{version}-SNAPSHOT-bin/apache-hive-{version}-SNAPSHOT-bin" as <install-dir> for the rest of the page.
Compile Hive on
...
branch-1
In branch-1, Hive supports both Hadoop 1.x and 2.x. You will need to specify which version of Hadoop to build against via a Maven profile. To build against Hadoop 1.x use the profile hadoop-1
; for Hadoop 2.x use hadoop-2
. For example to build against Hadoop 1.x, the above mvn command becomesTo build Hive against Hadoop 0.20, build instead with the following Maven profile:
No Format |
---|
$ mvn clean installpackage -Phadoop-1,dist |
Compile Hive Prior to 0.13 on Hadoop 0.20
...
- you must have Hadoop in your path OR
export HADOOP_HOME=<hadoop-install-dir>
In addition, you must use below HDFS commands to create /tmp
and /user/hive/warehouse
(aka hive.metastore.warehouse.dir
) and set them chmod g+w
in HDFS before before you can create a table in Hive.
Commands to perform this setup:
No Format |
---|
$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp $ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse $ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp $ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse |
...
No Format |
---|
$ export HIVE_HOME=<hive-install-dir> |
Running Hive CLI
To use the Hive command line interface (CLI) from the shell:
No Format |
---|
$ $HIVE_HOME/bin/hive |
Running HiveServer2 and Beeline
Starting from Hive 2.1, we need to run the schematool command below as an initialization step. For example, we can use "derby" as db type.
No Format |
---|
$ $HIVE_HOME/bin/schematool -dbType <db type> -initSchema
|
HiveServer2 (introduced in HiveServer2 (introduced in Hive 0.11) has its own CLI called Beeline. HiveCLI is now deprecated in favor of Beeline, as it lacks the multi-user, security, and other capabilities of HiveServer2. To run HiveServer2 and Beeline from shell:
...
- Hive configuration can be manipulated by:
- Editing hive-site.xml and defining any desired variables (including Hadoop variables) in it
- From the CLI using Using the set command (see belownext section)
- Invoking hive Hive (deprecated), Beeline or HiveServer2 using the syntax:
$ bin/hive --hiveconf x1=y1 --hiveconf x2=y2
//this sets the variables x1 and x2 to y1 and y2 respectively
- $ bin/hiveserver2 --hiveconf x1=y1 --hiveconf x2=y2 //this sets server-side variables x1 and x2 to y1 and y2 respectively
- $ bin/beeline --hiveconf x1=y1 --hiveconf x2=y2 //this sets client-side variables x1 and x2 to y1 and y2 respectively.
- Setting the
HIVE_OPTS
environment variable to "--hiveconf x1=y1 --hiveconf x2=y2
" which does the same as above.
...
- Hive queries are executed using map-reduce queries and, therefore, the behavior of such queries can be controlled by the Hadoop configuration variables.
The CLI HiveCLI (deprecated) and Beeline command 'SET' can be used to set any Hadoop (or Hive) configuration variable. For example:
No Format hive>beeline> SET mapred.job.tracker=myhost.mycompany.com:50030; hive>beeline> SET -v;
The latter shows all the current settings. Without the
-v
option only the variables that differ from the base Hadoop configuration are displayed.
...
Starting with release 0.7, Hive fully supports local mode execution. To enable this, the user can enable the following option:
No Format |
---|
hive> SET mapredmapreduce.jobframework.trackername=local; |
In addition, mapred.local.dir
should point to a path that's valid on the local machine (for example /tmp/<username>/mapred/local
). (Otherwise, the user will get an exception allocating local disk space.)
...
Note that there may be differences in the runtime environment of Hadoop server nodes and the machine running the Hive client (because of different jvm versions or different software libraries). This can cause unexpected behavior/errors while running in local mode. Also note that local mode execution is done in a separate, child jvm (of the Hive client). If the user so wishes, the maximum amount of memory for this child jvm can be controlled via the option hive.mapred.local.mem
. By default, it's set to zero, in which case Hive lets Hadoop determine the default memory limits of the child jvm.
Hive Logging
Anchor | ||||
---|---|---|---|---|
|
Anchor | ||||
---|---|---|---|---|
|
Hive uses log4j for logging. By default logs are not emitted to the console by the CLI. The default logging level is WARN
for Hive releases prior to 0.13.0. Starting with Hive 0.13.0, the default logging level is INFO
.
The logs are stored in the directory /tmp/<user.name>
:
/tmp/<user.name>/hive.log
Note: In local mode, prior to Hive 0.13.0 the log file name was ".log
" instead of "hive.log
". This bug was fixed in release 0.13.0 (see HIVE-5528 and HIVE-5676).
...
If the user wishes, the logs can be emitted to the console by adding the arguments shown below:
bin/hive --hiveconf hive.root.logger=INFO,console //for HiveCLI (deprecated)
bin/hiveserver2 --hiveconf hive.root.logger=INFO,console
Alternatively, the user can change the logging level only by using:
bin/hive --hiveconf hive.root.logger=INFO,DRFA //for HiveCLI (deprecated)
bin/hiveserver2 --hiveconf hive.root.logger=INFO,DRFA
Another option for logging is TimeBasedRollingPolicy (applicable for Hive 0Hive 1.151.0 and 0 and above, HIVE-9001) by providing DAILY option as shown below:
bin/hive --hiveconf hive.root.logger=INFO,DAILY
//for HiveCLI (deprecated)
bin/hiveserver2 --hiveconf hive.root.logger=INFO,DAILY
Note Note that setting hive.root.logger
via the 'set' command does not change logging properties since they are determined at initialization time.
Hive also stores query logs on a per Hive session basis in /tmp/<user.name>/
, but can be configured in hive-site.xml with the hive.querylog.location
property. Starting with Hive 1.1.0, EXPLAIN EXTENDED output for queries can be logged at the INFO level by setting the hive.log.explain.output
property to true.
Logging during Hive execution on a Hadoop cluster is controlled by Hadoop configuration. Usually Hadoop will produce one log file per map and reduce task stored on the cluster machine(s) where the task was executed. The log files can be obtained by clicking through to the Task Details page from the Hadoop JobTracker Web UI.
When using local mode (using mapredmapreduce.jobframework.trackername=local
), Hadoop/Hive execution logs are produced on the client machine itself. Starting with release 0.6 – Hive uses the hive-exec-log4j.properties
(falling back to hive-log4j.properties
only if it's missing) to determine where these logs are delivered by default. The default configuration file produces one log file per query executed in local mode and stores it under /tmp/<user.name>
. The intent of providing a separate configuration file is to enable administrators to centralize execution log capture if desired (on a NFS file server for example). Execution logs are invaluable for debugging run-time errors.
...
Error logs are very useful to debug problems. Please send them with any bugs (of which there are many!) to hive-dev@hadoop.apache.org
.
Audit Logs
Audit logs are logged from the Hive metastore server for every metastore API invocation.
An audit log has the function and some of the relevant function arguments logged in the metastore log file. It is logged at the INFO level of log4j, so you need to make sure that the logging at the INFO level is enabled (see HIVE-3505). The name of the log entry is "HiveMetaStore.audit".
Audit logs were added in Hive 0.7 for secure client connections (HIVE-1948) and in Hive 0.10 for non-secure connections (HIVE-3277; also see HIVE-2797).
DDL Operations
From Hive 2.1.0 onwards (with HIVE-13027), Hive uses Log4j2's asynchronous logger by default. Setting hive.async.log.enabled to false will disable asynchronous logging and fallback to synchronous logging. Asynchronous logging can give significant performance improvement as logging will be handled in a separate thread that uses the LMAX disruptor queue for buffering log messages. Refer to https://logging.apache.org/log4j/2.x/manual/async.html for benefits and drawbacks.
HiveServer2 Logs
HiveServer2 operation logs are available to clients starting in Hive 0.14. See HiveServer2 Logging for configuration.
Audit Logs
Audit logs are logged from the Hive metastore server for every metastore API invocation.
An audit log has the function and some of the relevant function arguments logged in the metastore log file. It is logged at the INFO level of log4j, so you need to make sure that the logging at the INFO level is enabled (see HIVE-3505). The name of the log entry is "HiveMetaStore.audit".
Audit logs were added in Hive 0.7 for secure client connections (HIVE-1948) and in Hive 0.10 for non-secure connections (HIVE-3277; also see HIVE-2797).
Perf Logger
In order to obtain the performance metrics via the PerfLogger, you need to set DEBUG level logging for the PerfLogger class (HIVE-12675). This can be achieved by setting the following in the log4j properties file.
log4j.logger.org.apache.hadoop.hive.ql.log.PerfLogger=DEBUG
If the logger level has already been set to DEBUG at root via hive.root.logger, the above setting is not required to see the performance logs.
DDL Operations
The Hive DDL operations are The Hive DDL operations are documented in Hive Data Definition Language.
Creating Hive Tables
No Format |
---|
hive> CREATE TABLE pokes (foo INT, bar STRING); |
...