Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: add links to HCat & WebHCat manuals; REPLACE COLUMNS example; minor edits

...

DISCLAIMER: Hive has only been tested on unix(linux) and mac systems using Java 1.6 for now -
although it may very well work on other similar platforms. It does not work on Cygwin.

...

Requirements

  • Java 1.6
  • Hadoop 0.20.x.

Installing Hive from a Stable Release

Start by downloading the most recent stable release of Hive from one of the Apache download mirrors (see Hive Releases).

...

In addition, you must create /tmp and /user/hive/warehouse
(aka hive.metastore.warehouse.dir) and set them chmod g+w in
HDFS before you can create a table in Hive.

Commands to perform this setup:

No Format
  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /tmp
  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse
  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp
  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse

You may find it useful, though it's not necessary, to set HIVE_HOME:

No Format
  $ export HIVE_HOME=<hive-install-dir>

...

For more information, see HCatalog Installation from Tarball and HCatalog CLI in the HCatalog manual.

Running WebHCat (Templeton)

...

For more information, see WebHCat Installation in the WebHCat manual.

Configuration management overview

  • Hive by default gets its configuration from <install-dir>/conf/hive-default.xml
  • The location of the Hive configuration directory can be changed by setting the HIVE_CONF_DIR environment variable.
  • Configuration variables can be changed by (re-)defining them in <install-dir>/conf/hive-site.xml
  • Log4j configuration is stored in <install-dir>/conf/hive-log4j.properties
  • Hive configuration is an overlay on top of hadoop

...

  • it inherits the hadoop configuration variables by default.
  • Hive configuration can be manipulated by:
    • Editing hive-site.xml and defining any desired variables (including hadoop variables) in it
    • From the cli using the set command (see below)
    By invoking
    • Invoking hive using the syntax:
      • $ bin/hive -hiveconf x1=y1 -hiveconf x2=y2
        this sets the variables x1 and x2 to y1 and y2 respectively
    By setting
    • Setting the HIVE_OPTS environment variable to "-hiveconf x1=y1 -hiveconf x2=y2" which does the same as above.

Runtime configuration

  • Hive queries are executed using map-reduce queries and, therefore, the behavior
    of such queries can be controlled by the hadoop configuration variables.
  • The cli command 'SET' can be used to set any hadoop (or hive) configuration variable. For example:
    No Format
    
        hive> SET mapred.job.tracker=myhost.mycompany.com:50030;
        hive> SET -v;
    
    The latter shows all the current settings. Without the -v option only the

...

  • variables that differ from the base hadoop configuration are displayed.

Hive, Map-Reduce and Local-Mode

...

While this usually points to a map-reduce cluster with multiple nodes, Hadoop also offers a nifty option to run map-reduce jobs locally on the user's workstation. This can be very useful to run queries over small data sets - in such cases local mode execution is usually significantly faster than submitting jobs to a large cluster. Data is accessed transparently from HDFS. Conversely, local mode only runs with one reducer and can be very slow processing larger data sets.

...

In addition, mapred.local.dir should point to a path that's valid on the local machine (for example /tmp/<username>/mapred/local). (Otherwise, the user will get an exception allocating local disk space.).

Starting v-0.7, Hive also supports a mode to run map-reduce jobs in local-mode automatically. The relevant options are hive.exec.mode.local.auto, hive.exec.mode.local.auto.inputbytes.max, and hive.exec.mode.local.auto.tasks.max:

No Format
  hive> SET hive.exec.mode.local.auto=false;

note Note that this feature is disabled by default. If enabled - , Hive analyzes the size of each map-reduce job in a query and may run it locally if the following thresholds are satisfied:

...

Hive uses log4j for logging. By default logs are not emitted to the
console by the CLI. The default logging level is WARN and the logs are stored in the folder:

  • /tmp/<user.name>/hive.log

If the user wishes - the logs can be emitted to the console by adding
the arguments shown below:

...

Note that setting hive.root.logger via the 'set' command does not
change logging properties since they are determined at initialization time.

...

When using local mode (using mapred.job.tracker=local), Hadoop/Hive execution logs are produced on the client machine itself. Starting v-0.6 - Hive uses the hive-exec-log4j.properties (falling back to hive-log4j.properties only if it's missing) to determine where these logs are delivered by default. The default configuration file produces one log file per query executed in local mode and stores it under /tmp/<user.name>. The intent of providing a separate configuration file is to enable administrators to centralize execution log capture if desired (on a NFS file server for example). Execution logs are invaluable for debugging run-time errors.

Error logs are very useful to debug problems. Please send them with any bugs (of which there are many!) to hive-dev@hadoop.apache.org.

DDL Operations

Creating Hive

...

Tables

No Format
  hive> CREATE TABLE pokes (foo INT, bar STRING);

Creates creates a table called pokes with two columns, the first being an integer and the other a string.

No Format
  hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);

Creates creates a table called invites with two columns and a partition column
called ds. The partition column is a virtual column. It is not part
of the data itself but is derived from the partition that a
particular dataset is loaded into.

By default, tables are assumed to be of text input format and the
delimiters are assumed to be ^A(ctrl-a).

Browsing through Tables

No Format
  hive> SHOW TABLES;

lists all the tables.

No Format
  hive> SHOW TABLES '.*s';

lists all the table that end with 's'. The pattern matching follows Java regular
expressions. Check out this link for documentation http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html.

No Format
hive> DESCRIBE invites;

shows the list of columns.

Altering and Dropping Tables

Table As for altering tables, table names can be changed and additional columns can be dropped added or replaced:

No Format
  hive> ALTER TABLE events RENAME TO 3koobecaf;
  hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);
  hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
  hive> ALTER TABLE events RENAME TO 3koobecaf ALTER TABLE invites REPLACE COLUMNS (foo INT, bar STRING, baz INT COMMENT 'baz replaces new_col2');

Note that REPLACE COLUMNS replaces all existing columns and only changes the table's schema, not the data. The table must use a native SerDe. REPLACE COLUMNS can also be used to drop columns from the table's schema:

No Format

  hive> ALTER TABLE invites REPLACE COLUMNS (foo INT COMMENT 'only keep the first column');

Dropping tables:

No Format
  hive> DROP TABLE pokes;

...

Metadata is in an embedded Derby database whose disk storage location is determined by the
hive configuration variable named javax.jdo.option.ConnectionURL. By default
this location is ./metastore_db (see conf/hive-default.xml), this location is ./metastore_db

Right now, in the default configuration, this metadata can only be seen by
one user at a time.

Metastore can be stored in any database that is supported by JPOX. The
location and the type of the RDBMS can be controlled by the two variables
javax.jdo.option.ConnectionURL and javax.jdo.option.ConnectionDriverName.
Refer to JDO (or JPOX) documentation for more details on supported databases.
The database schema is defined in JDO metadata annotations file package.jdo
at src/contrib/hive/metastore/src/model.

...

If you want to run the metastore as a network server so it can be accessed
from multiple nodes try HiveDerbyServerMode.

...

Loads a file that contains two columns separated by ctrl-a into pokes table.
'localLOCAL' signifies that the input file is on the local file system. If 'localLOCAL'
is omitted then it looks for the file in HDFS.

The keyword 'overwriteOVERWRITE' signifies that existing data in the table is deleted.
If the 'overwriteOVERWRITE' keyword is omitted, data files are appended to existing data sets.

...

  • NO verification of data against the schema is performed by the load command.
  • If the file is in hdfs, it is moved into the Hive-controlled file system namespace.
    The root of the Hive directory is specified by the option hive.metastore.warehouse.dir
    in hive-default.xml. We advise users to create this directory before
    trying to create tables via Hive.

...

The two LOAD statements above load data into two different partitions of the table
invites. Table invites must be created as partitioned by the key ds for this to succeed.

...

Some example queries are shown below. They are available in build/dist/examples/queries.
More are available in the hive sources at ql/src/test/queries/positive.

SELECTS and FILTERS

No Format
  hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';

selects column 'foo' from all rows of partition ds=2008-08-15 of the invites table. The results are not
stored anywhere, but are displayed on the console.

Note that in all the examples that follow, INSERT (into a hive table, local
directory or HDFS directory) is optional.

...

selects all rows from partition ds=2008-08-15 of the invites table into an HDFS directory. The result data
is in files (depending on the number of mappers) in that directory.
NOTE: partition columns if any are selected by the use of *. They can also
be specified in the projection clauses.

...

No Format
  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;

Selects selects all rows from pokes table into a local directory.

No Format
  hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;
  hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100;
  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a;
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a;
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(*) FROM invites a WHERE a.ds='2008-08-15';
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a;
  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a;

Sum selects the sum of a column. The avg, min, or max can also be used. Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

...

This streams the data in the map phase through the script /bin/cat (like hadoop streaming).
Similarly - streaming can be used on the reduce side (please see the Hive Tutorial for examples).

Simple Example Use Cases

MovieLens User Ratings

...

The format of Apache weblog is customizable, while most webmasters uses use the default.
For default Apache weblog, we can create a table with the following command.

More about !RegexSerDe can be found here: http://issues.apache.org/jira/browse/HIVE-662.

No Format
add jar ../build/contrib/hive_contrib.jar;

CREATE TABLE apachelog (
  host STRING,
  identity STRING,
  user STRING,
  time STRING,
  request STRING,
  status STRING,
  size STRING,
  referer STRING,
  agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?",
  "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE;