Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: fix wget download site per HIVE-5341; change {code} to {noformat}

...

Next you need to unpack the tarball. This will result in the creation of a subdirectory named hive-x.y.z:

Code Blocknoformat
  $ tar -xzvf hive-x.y.z.tar.gz

Set the environment variable HIVE_HOME to point to the installation directory:

Code Blocknoformat
  $ cd hive-x.y.z
  $ export HIVE_HOME={{pwd}}

Finally, add $HIVE_HOME/bin to your PATH:

Code Blocknoformat
  $ export PATH=$HIVE_HOME/bin:$PATH

...

The Hive SVN repository is located here: http://svn.apache.org/repos/asf/hive/trunk

Code Blocknoformat
  $ svn co http://svn.apache.org/repos/asf/hive/trunk hive
  $ cd hive
  $ ant clean package
  $ cd build/dist
  $ ls
  README.txt
  bin/ (all the shell scripts)
  lib/ (required jar files)
  conf/ (configuration files)
  examples/ (sample input and query files)

...

Compile hive on hadoop 23

Code Blocknoformat
  $ svn co http://svn.apache.org/repos/asf/hive/trunk hive
  $ cd hive
  $ ant clean package -Dhadoop.version=0.23.3 -Dhadoop-0.23.version=0.23.3 -Dhadoop.mr.rev=23
  $ ant clean package -Dhadoop.version=2.0.0-alpha -Dhadoop-0.23.version=2.0.0-alpha -Dhadoop.mr.rev=23

...

Commands to perform this setup

Code Blocknoformat
  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /tmp
  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse
  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp
  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse

You may find it useful, though it's not necessary, to set HIVE_HOME

Code Blocknoformat
  $ export HIVE_HOME=<hive-install-dir>

To use the hive command line interface (cli) from the shell:

Code Blocknoformat
  $ $HIVE_HOME/bin/hive

Configuration management overview

...

  • The cli command 'SET' can be used to set any hadoop (or hive) configuration variable. For example:
Code Blocknoformat
    hive> SET mapred.job.tracker=myhost.mycompany.com:50030;
    hive> SET -v;

...

Hive compiler generates map-reduce jobs for most queries. These jobs are then submitted to the Map-Reduce cluster indicated by the variable:

Code Blocknoformat
  mapred.job.tracker

While this usually points to a map-reduce cluster with multiple nodes, Hadoop also offers a nifty option to run map-reduce jobs locally on the user's workstation. This can be very useful to run queries over small data sets - in such cases local mode execution is usually significantly faster than submitting jobs to a large cluster. Data is accessed transparently from HDFS. Conversely, local mode only runs with one reducer and can be very slow processing larger data sets.

Starting v-0.7, Hive fully supports local mode execution. To enable this, the user can enable the following option:

Code Blocknoformat
  hive> SET mapred.job.tracker=local;

...

Starting v-0.7, Hive also supports a mode to run map-reduce jobs in local-mode automatically. The relevant options are:

Code Blocknoformat
  hive> SET hive.exec.mode.local.auto=false;

...

Creating Hive tables and browsing through them

Code Blocknoformat
  hive> CREATE TABLE pokes (foo INT, bar STRING);

Creates a table called pokes with two columns, the first being an integer and the other a string

Code Blocknoformat
  hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);

...

By default, tables are assumed to be of text input format and the
delimiters are assumed to be ^A(ctrl-a).

Code Blocknoformat
  hive> SHOW TABLES;

lists all the tables

Code Blocknoformat
  hive> SHOW TABLES '.*s';

lists all the table that end with 's'. The pattern matching follows Java regular
expressions. Check out this link for documentation http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html

Code Blocknoformat
hive> DESCRIBE invites;

shows the list of columns

As for altering tables, table names can be changed and additional columns can be dropped:

Code Blocknoformat
  hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);
  hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
  hive> ALTER TABLE events RENAME TO 3koobecaf;

Dropping tables:

Code Blocknoformat
  hive> DROP TABLE pokes;

Metadata Store

...

Loading data from flat files into Hive:

Code Blocknoformat
  hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;

...

  • NO verification of data against the schema is performed by the load command.
  • If the file is in hdfs, it is moved into the Hive-controlled file system namespace.
    The root of the Hive directory is specified by the option hive.metastore.warehouse.dir
    in hive-default.xml. We advise users to create this directory before
    trying to create tables via Hive.
Code Blocknoformat
  hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
  hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08');

The two LOAD statements above load data into two different partitions of the table
invites. Table invites must be created as partitioned by the key ds for this to succeed.

Code Blocknoformat
  hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');

...

Some example queries are shown below. They are available in build/dist/examples/queries.
More are available in the hive sources at ql/src/test/queries/positive

SELECTS and FILTERS

Code Blocknoformat
  hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';

...

Note that in all the examples that follow, INSERT (into a hive table, local
directory or HDFS directory) is optional.

Code Blocknoformat
  hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';

...

Partitioned tables must always have a partition selected in the WHERE clause of the statement.

Code Blocknoformat
  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;

Selects all rows from pokes table into a local directory

Code Blocknoformat
  hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;
  hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100;
  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a;
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a;
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(*) FROM invites a WHERE a.ds='2008-08-15';
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a;
  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a;

Sum of a column. avg, min, max can also be used. Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

GROUP BY

Code Blocknoformat
  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*) WHERE a.foo > 0 GROUP BY a.bar;
  hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;

Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

JOIN

Code Blocknoformat
  hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;

MULTITABLE INSERT

Code Blocknoformat
  FROM src
  INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
  INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
  INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
  INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;

STREAMING

Code Blocknoformat
  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';

...

First, create a table with tab-delimited text file format:

Code Blocknoformat
CREATE TABLE u_data (
  userid INT,
  movieid INT,
  rating INT,
  unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

Then, download and extract the data files:

Code Blocknoformat
wget http://www.grouplens.org/system//sites/www.grouplens.org/external_files/data/ml-data.tar+0.gz
tar xvzf ml-data.tar+0.gz

And load it into the table that was just created:

Code Blocknoformat
LOAD DATA LOCAL INPATH 'ml-data/u.data'
OVERWRITE INTO TABLE u_data;

Count the number of rows in table u_data:

Code Blocknoformat
SELECT COUNT(*) FROM u_data;

...

Now we can do some complex data analysis on the table u_data:

Create weekday_mapper.py:

Code Blocknoformat
import sys
import datetime

for line in sys.stdin:
  line = line.strip()
  userid, movieid, rating, unixtime = line.split('\t')
  weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
  print '\t'.join([userid, movieid, rating, str(weekday)])

Use the mapper script:

Code Blocknoformat
CREATE TABLE u_data_new (
  userid INT,
  movieid INT,
  rating INT,
  weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

add FILE weekday_mapper.py;

INSERT OVERWRITE TABLE u_data_new
SELECT
  TRANSFORM (userid, movieid, rating, unixtime)
  USING 'python weekday_mapper.py'
  AS (userid, movieid, rating, weekday)
FROM u_data;

SELECT weekday, COUNT(*)
FROM u_data_new
GROUP BY weekday;

...

More about !RegexSerDe can be found here: http://issues.apache.org/jira/browse/HIVE-662

Code Blocknoformat
add jar ../build/contrib/hive_contrib.jar;

CREATE TABLE apachelog (
  host STRING,
  identity STRING,
  user STRING,
  time STRING,
  request STRING,
  status STRING,
  size STRING,
  referer STRING,
  agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?",
  "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE;