Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Table

...

of

...

Contents

...

Table of Contents

DISCLAIMER:

...

Hive

...

has

...

only

...

been

...

tested

...

on

...

unix(linux)

...

and

...

mac

...

systems

...

using

...

Java

...

1.6

...

for

...

now

...

-

...


although

...

it

...

may

...

very

...

well

...

work

...

on

...

other

...

similar

...

platforms.

...

It

...

does

...

not

...

work

...

on

...

Cygwin.

...


Most

...

of

...

our

...

testing

...

has

...

been

...

on

...

Hadoop

...

0.20

...

-

...

so

...

we

...

advise

...

running

...

it

...

against

...

this

...

version

...

even

...

though

...

it

...

may

...

compile/work

...

against

...

other

...

versions

...

Hive

...

introduction

...

videos

...

From

...

Cloudera

...

Hive

...

Introduction

...

Video

Hive Tutorial Video

Installation and Configuration

Requirements

  • Java 1.6
  • Hadoop 0.17.x

...

  • to

...

  • 0.20.x.

...

  • Installing

...

  • Hive

...

  • from

...

  • a

...

  • Stable

...

  • Release

...

Start

...

by

...

downloading

...

the

...

most

...

recent

...

stable

...

release

...

of

...

Hive

...

from

...

one

...

of

...

the

...

Apache

...

download

...

mirrors

...

(see

...

Hive

...

Releases

...

).

...

Next

...

you

...

need

...

to

...

unpack

...

the

...

tarball.

...

This

...

will

...

result

...

in

...

the

...

creation

...

of

...

a

...

subdirectory

...

named

...

hive-x.y.z

...

:

{
Code Block
}
  $ tar -xzvf hive-x.y.z.tar.gz
{code}

Set

...

the

...

environment

...

variable

...

HIVE_HOME

...

to

...

point

...

to

...

the

...

installation

...

directory:

{
Code Block
}
  $ cd hive-x.y.z
  $ export HIVE_HOME={{pwd}}
{code}

Finally,

...

add

...

$HIVE_HOME/bin

...

to

...

your

...

PATH

...

:

{
Code Block
}
  $ export PATH=$HIVE_HOME/bin:$PATH
{code}
h3. Building Hive from Source

The Hive SVN repository is located here: 

Building Hive from Source

The Hive SVN repository is located here: http://svn.apache.org/repos/asf/hive/trunkImage Added

Code Block


{code}
  $ svn co http://svn.apache.org/repos/asf/hive/trunk hive
  $ cd hive
  $ ant clean package
  $ cd build/dist
  $ ls
  README.txt
  bin/ (all the shell scripts)
  lib/ (required jar files)
  conf/ (configuration files)
  examples/ (sample input and query files)
{code}

In

...

the

...

rest

...

of

...

the

...

page,

...

we

...

use

...

build/dist

...

and

...

<install-dir>

...

interchangeably.

...

Running

...

Hive

...

Hive

...

uses

...

hadoop

...

that

...

means:

...

  • you

...

  • must

...

  • have

...

  • hadoop

...

  • in

...

  • your

...

  • path

...

  • OR
  • export HADOOP_HOME=<hadoop-install-dir>

...

In

...

addition,

...

you

...

must

...

create

...

/tmp

...

and

...

/user/hive/warehouse

...


(aka

...

hive.metastore.warehouse.dir

...

)

...

and

...

set

...

them

...

chmod

...

g+w

...

in

...


HDFS

...

before

...

a

...

table

...

can

...

be

...

created

...

in

...

Hive.

...

Commands

...

to

...

perform

...

this

...

setup

{
Code Block
}
  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /tmp
  $ $HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse
  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp
  $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse
{code}

I

...

also

...

find

...

it

...

useful

...

but

...

not

...

necessary

...

to

...

set

...

HIVE_HOME

Code Block
}}
{code}
  $ export HIVE_HOME=<hive-install-dir>
{code}

To

...

use

...

hive

...

command

...

line

...

interface

...

(cli)

...

from

...

the

...

shell:

{
Code Block
}
  $ $HIVE_HOME/bin/hive
{code}
h3. Configuration management overview

- Hive default configuration is stored in {{

Configuration management overview

  • Hive default configuration is stored in <install-dir>/conf/hive-default.xml

...


  • Configuration variables can be changed by (re-)defining

...

  • them

...

  • in

...

  • <install-dir>/conf/hive-site.xml

...

  • The

...

  • location

...

  • of

...

  • the

...

  • Hive

...

  • configuration

...

  • directory

...

  • can

...

  • be

...

  • changed

...

  • by

...

  • setting

...

  • the

...

  • HIVE_CONF_DIR

...

  • environment

...

  • variable.

...

  • Log4j

...

  • configuration

...

  • is

...

  • stored

...

  • in

...

  • <install-dir>/conf/hive-log4j.properties

...

  • Hive configuration is an overlay on top of hadoop - meaning the hadoop configuration variables are inherited by default.
  • Hive configuration can be manipulated by:
  • Editing hive-site.xml

...

  • and

...

  • defining

...

  • any

...

  • desired

...

  • variables

...

  • (including

...

  • hadoop

...

  • variables)

...

  • in

...

  • it

...

  • From

...

  • the

...

  • cli

...

  • using

...

  • the

...

  • set

...

  • command

...

  • (see

...

  • below)

...

  • By

...

  • invoking

...

  • hive

...

  • using

...

  • the

...

  • syntax:

...

    • $ bin/hive

...

    • -hiveconf

...

    • x1=y1

...

    • -hiveconf

...

    • x2=y2
      this sets the variables x1 and x2 to y1 and y2 respectively
  • By setting the HIVE_OPTS environment variable to "-hiveconf x1=y1 -hiveconf x2=y2" which does the same as above

    Runtime configuration

  • Hive queries are executed using map-reduce queries and, therefore, the behavior
    of such queries can be controlled by the hadoop configuration variables.
  • The cli command 'SET' can be used to set any hadoop (or hive) configuration variable. For example:
Code Block
}}
      this sets the variables x1 and x2 to y1 and y2 respectively
 * By setting the {{HIVE_OPTS}} environment variable to "-hiveconf x1=y1 -hiveconf x2=y2" which does the same as above
h3. Runtime configuration

 * Hive queries are executed using map-reduce queries and, therefore, the behavior 
  of such queries can be controlled by the hadoop configuration variables.

 * The cli command 'SET' can be used to set any hadoop (or hive) configuration variable. For example:

{code}
    hive> SET mapred.job.tracker=myhost.mycompany.com:50030;
    hive> SET -v;
{code}

  The latter shows all the current settings. Without the

The latter shows all the current settings. Without the -v option only the
variables that differ from the base hadoop configuration are displayed

Hive, Map-Reduce and Local-Mode

Hive compiler generates map-reduce jobs for most queries. These jobs are then submitted to the Map-Reduce cluster indicated by the variable:

Code Block
 {{-v}} option only the 
  variables that differ from the base hadoop configuration are displayed
h3. Hive, Map-Reduce and Local-Mode

Hive compiler generates map-reduce jobs for most queries. These jobs are then submitted to the Map-Reduce cluster indicated by the variable:
{code} 
  mapred.job.tracker
{code}

While

...

this

...

usually

...

points

...

to

...

a

...

map-reduce

...

cluster

...

with

...

multiple

...

nodes,

...

Hadoop

...

also

...

offers

...

a

...

nifty

...

option

...

to

...

run

...

map-reduce

...

jobs

...

locally

...

on

...

the

...

user's

...

workstation.

...

This

...

can

...

be

...

very

...

useful

...

to

...

run

...

queries

...

over

...

small

...

data

...

sets

...

-

...

in

...

such

...

cases

...

local

...

mode

...

execution

...

is

...

usually

...

significantly

...

faster

...

than

...

submitting

...

jobs

...

to

...

a

...

large

...

cluster.

...

Data

...

is

...

accessed

...

transparently

...

from

...

HDFS.

...

Conversely,

...

local

...

mode

...

only

...

runs

...

with

...

one

...

reducer

...

and

...

can

...

be

...

very

...

slow

...

processing

...

larger

...

data

...

sets.

...

Starting

...

v-0.7,

...

Hive

...

fully

...

supports

...

local

...

mode

...

execution.

...

To

...

enable

...

this,

...

the

...

user

...

can

...

enable

...

the

...

following

...

option:

{
Code Block
}
  hive> SET mapred.job.tracker=local;
{code}

In

...

addition,

...

mapred.local.dir

...

should

...

point

...

to

...

a

...

path

...

that's

...

valid

...

on

...

the

...

local

...

machine

...

(for

...

example

...

/tmp/<username>/mapred/local

...

).

...

(Otherwise,

...

the

...

user

...

will

...

get

...

an

...

exception

...

allocating

...

local

...

disk

...

space).

...

Starting

...

v-0.7,

...

Hive

...

also

...

supports

...

a

...

mode

...

to

...

run

...

map-reduce

...

jobs

...

in

...

local-mode

...

automatically.

...

The

...

relevant

...

options

...

are:

{
Code Block
}
  hive> SET hive.exec.mode.local.auto=false;
{code}

note that this feature is _disabled_ by default. If enabled - Hive analyzes the size of each map-reduce job in a query and may run it locally if the following thresholds are satisfied:
 * The total input size of the job is lower than: {{;

note that this feature is disabled by default. If enabled - Hive analyzes the size of each map-reduce job in a query and may run it locally if the following thresholds are satisfied:

  • The total input size of the job is lower than: hive.exec.mode.local.auto.inputbytes.max

...

  • (128MB

...

  • by

...

  • default)

...

  • The

...

  • total

...

  • number

...

  • of

...

  • map-tasks

...

  • is

...

  • less

...

  • than:

...

  • hive.exec.mode.local.auto.tasks.max

...

  • (4

...

  • by

...

  • default)

...

  • The

...

  • total

...

  • number

...

  • of

...

  • reduce

...

  • tasks

...

  • required

...

  • is

...

  • 1

...

  • or

...

  • 0.

...

So

...

for

...

queries

...

over

...

small

...

data

...

sets,

...

or

...

for

...

queries

...

with

...

multiple

...

map-reduce

...

jobs

...

where

...

the

...

input

...

to

...

subsequent

...

jobs

...

is

...

substantially

...

smaller

...

(because

...

of

...

reduction/filtering

...

in

...

the

...

prior

...

job),

...

jobs

...

may

...

be

...

run

...

locally.

...

Note

...

that

...

there

...

may

...

be

...

differences

...

in

...

the

...

runtime

...

environment

...

of

...

hadoop

...

server

...

nodes

...

and

...

the

...

machine

...

running

...

the

...

hive

...

client

...

(because

...

of

...

different

...

jvm

...

versions

...

or

...

different

...

software

...

libraries).

...

This

...

can

...

cause

...

unexpected

...

behavior/errors

...

while

...

running

...

in

...

local

...

mode.

...

Also

...

note

...

that

...

local

...

mode

...

execution

...

is

...

done

...

in

...

a

...

separate,

...

child

...

jvm

...

(of

...

the

...

hive

...

client).

...

If

...

the

...

user

...

so

...

wishes,

...

the

...

maximum

...

amount

...

of

...

memory

...

for

...

this

...

child

...

jvm

...

can

...

be

...

controlled

...

via

...

the

...

option

...

hive.mapred.local.mem

...

.

...

By

...

default,

...

it's

...

set

...

to

...

zero,

...

in

...

which

...

case

...

Hive

...

lets

...

Hadoop

...

determine

...

the

...

default

...

memory

...

limits

...

of

...

the

...

child

...

jvm.

...

Error

...

Logs

...

Hive

...

uses

...

log4j

...

for

...

logging.

...

By

...

default

...

logs

...

are

...

not

...

emitted

...

to

...

the

...


console

...

by

...

the

...

CLI.

...

The

...

default

...

logging

...

level

...

is

...

WARN

...

and

...

the

...

logs

...

are

...

stored

...

in

...

the

...

folder:

...

  • /tmp/

...

  • <user.

...

  • name>/hive.log

...

If

...

the

...

user

...

wishes

...

-

...

the

...

logs

...

can

...

be

...

emitted

...

to

...

the

...

console

...

by

...

adding

...


the

...

arguments

...

shown

...

below:

...

  • bin/hive

...

  • -hiveconf

...

  • hive.root.logger=INFO,console

...

Alternatively,

...

the

...

user

...

can

...

change

...

the

...

logging

...

level

...

only

...

by

...

using:

...

  • bin/hive

...

  • -hiveconf

...

  • hive.root.logger=INFO,DRFA

...

Note

...

that

...

setting

...

hive.root.logger

...

via

...

the

...

'set'

...

command

...

does

...

not

...


change

...

logging

...

properties

...

since

...

they

...

are

...

determined

...

at

...

initialization

...

time.

...

Logging

...

during

...

Hive

...

execution

...

on

...

a

...

Hadoop

...

cluster

...

is

...

controlled

...

by

...

Hadoop

...

configuration.

...

Usually

...

Hadoop

...

will

...

produce

...

one

...

log

...

file

...

per

...

map

...

and

...

reduce

...

task

...

stored

...

on

...

the

...

cluster

...

machine(s)

...

where

...

the

...

task

...

was

...

executed.

...

The

...

log

...

files

...

can

...

be

...

obtained

...

by

...

clicking

...

through

...

to

...

the

...

Task

...

Details

...

page

...

from

...

the

...

Hadoop

...

JobTracker

...

Web

...

UI.

...

When

...

using

...

local

...

mode

...

(using

...

mapred.job.tracker=local

...

),

...

Hadoop/Hive

...

execution

...

logs

...

are

...

produced

...

on

...

the

...

client

...

machine

...

itself.

...

Starting

...

v-0.6

...

-

...

Hive

...

uses

...

the

...

hive-exec-log4j.properties

...

(falling

...

back

...

to

...

hive-log4j.properties

...

only

...

if

...

it's

...

missing)

...

to

...

determine

...

where

...

these

...

logs

...

are

...

delivered

...

by

...

default.

...

The

...

default

...

configuration

...

file

...

produces

...

one

...

log

...

file

...

per

...

query

...

executed

...

in

...

local

...

mode

...

and

...

stores

...

it

...

under

...

/tmp/

...

<user.name>. The intent of providing a separate configuration file is to enable administrators to centralize execution log capture if desired (on a NFS file server for example). Execution logs are invaluable for debugging run-time errors.

Error logs are very useful to debug problems. Please send them with any bugs (of which there are many!) to hive-dev@hadoop.apache.org.

DDL Operations

Creating Hive tables and browsing through them

Code Block
}}.
h2. DDL Operations

Creating Hive tables and browsing through them

{code}
  hive> CREATE TABLE pokes (foo INT, bar STRING);  
{code}

Creates

...

a

...

table

...

called

...

pokes

...

with

...

two

...

columns,

...

the

...

first

...

being

...

an

...

integer

...

and

...

the

...

other

...

a

...

string

{
Code Block
}
  hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);  
{code}

Creates

...

a

...

table

...

called

...

invites

...

with

...

two

...

columns

...

and

...

a

...

partition

...

column

...


called

...

ds.

...

The

...

partition

...

column

...

is

...

a

...

virtual

...

column.

...

It

...

is

...

not

...

part

...


of

...

the

...

data

...

itself

...

but

...

is

...

derived

...

from

...

the

...

partition

...

that

...

a

...


particular

...

dataset

...

is

...

loaded

...

into.

...

By

...

default,

...

tables

...

are

...

assumed

...

to

...

be

...

of

...

text

...

input

...

format

...

and

...

the

...


delimiters

...

are

...

assumed

...

to

...

be

...

^A(ctrl-a).

{
Code Block
}
  hive> SHOW TABLES;
{code}

lists

...

all

...

the

...

tables

{
Code Block
}
  hive> SHOW TABLES '.*s';
{code}

lists

...

all

...

the

...

table

...

that

...

end

...

with

...

's'.

...

The

...

pattern

...

matching

...

follows

...

Java

...

regular

...


expressions.

...

Check

...

out

...

this

...

link

...

for

...

documentation

...

http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.htmlImage Added

Code Block


{code}
hive> DESCRIBE invites;
{code}

shows

...

the

...

list

...

of

...

columns

...

As

...

for

...

altering

...

tables,

...

table

...

names

...

can

...

be

...

changed

...

and

...

additional

...

columns

...

can

...

be

...

dropped:

{
Code Block
}
  hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);
  hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
  hive> ALTER TABLE events RENAME TO 3koobecaf;
{code}

Dropping

...

tables:

{
Code Block
}
  hive> DROP TABLE pokes;
{code}
h3. Metadata Store

Metadata is in an embedded Derby database whose disk storage location is determined by the 
hive configuration variable named {{

Metadata Store

Metadata is in an embedded Derby database whose disk storage location is determined by the
hive configuration variable named javax.jdo.option.ConnectionURL

...

.

...

By

...

default

...


(see

...

conf/hive-default.xml

...

),

...

this

...

location

...

is ./metastore_db

...

Right

...

now,

...

in

...

the

...

default

...

configuration,

...

this

...

metadata

...

can

...

only

...

be

...

seen

...

by

...


one

...

user

...

at

...

a

...

time.

...

Metastore

...

can

...

be

...

stored

...

in

...

any

...

database

...

that

...

is

...

supported

...

by

...

JPOX.

...

The

...


location

...

and

...

the

...

type

...

of

...

the

...

RDBMS

...

can

...

be

...

controlled

...

by

...

the

...

two

...

variables

...


javax.jdo.option.ConnectionURL

...

and

...

javax.jdo.option.ConnectionDriverName

...

.

...


Refer

...

to

...

JDO

...

(or

...

JPOX)

...

documentation

...

for

...

more

...

details

...

on

...

supported

...

databases.

...


The

...

database

...

schema

...

is

...

defined

...

in

...

JDO

...

metadata

...

annotations

...

file

...

package.jdo

...


at src/contrib/hive/metastore/src/model

...

.

...

In

...

the

...

future,

...

the

...

metastore

...

itself

...

can

...

be

...

a

...

standalone

...

server.

...

If

...

you

...

want

...

to

...

run

...

the

...

metastore

...

as

...

a

...

network

...

server

...

so

...

it

...

can

...

be

...

accessed

...


from

...

multiple

...

nodes

...

try

...

HiveDerbyServerMode.

...

DML

...

Operations

...

Loading

...

data

...

from

...

flat

...

files

...

into

...

Hive:

{
Code Block
}
  hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes; 
{code}

Loads

...

a

...

file

...

that

...

contains

...

two

...

columns

...

separated

...

by

...

ctrl-a

...

into

...

pokes

...

table.

...


'local'

...

signifies

...

that

...

the

...

input

...

file

...

is

...

on

...

the

...

local

...

file

...

system.

...

If

...

'local'

...


is

...

omitted

...

then

...

it

...

looks

...

for

...

the

...

file

...

in

...

HDFS.

...

The

...

keyword

...

'overwrite'

...

signifies

...

that

...

existing

...

data

...

in

...

the

...

table

...

is

...

deleted.

...


If

...

the

...

'overwrite'

...

keyword

...

is

...

omitted,

...

data

...

files

...

are

...

appended

...

to

...

existing

...

data

...

sets.

...

NOTES:

...

  • NO

...

  • verification

...

  • of

...

  • data

...

  • against

...

  • the

...

  • schema

...

  • is

...

  • performed

...

  • by

...

  • the

...

  • load

...

  • command.

...

  • If

...

  • the

...

  • file

...

  • is

...

  • in

...

  • hdfs,

...

  • it

...

  • is

...

  • moved

...

  • into

...

  • the

...

  • Hive-controlled

...

  • file

...

  • system

...

  • namespace.

...


  • The

...

  • root

...

  • of

...

  • the

...

  • Hive

...

  • directory

...

  • is

...

  • specified

...

  • by

...

  • the

...

  • option

...

  • hive.metastore.warehouse.dir

...


  • in hive-default.xml

...

  • .

...

  • We

...

  • advise

...

  • users

...

  • to

...

  • create

...

  • this

...

  • directory

...

  • before

...


  • trying

...

  • to

...

  • create

...

  • tables

...

  • via

...

  • Hive.
{
Code Block
}
  hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
  hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08');
{code}

The

...

two

...

LOAD

...

statements

...

above

...

load

...

data

...

into

...

two

...

different

...

partitions

...

of

...

the

...

table

...


invites.

...

Table

...

invites

...

must

...

be

...

created

...

as

...

partitioned

...

by

...

the

...

key

...

ds

...

for

...

this

...

to

...

succeed.

{
Code Block
}
  hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
{code}

The

...

above

...

command

...

will

...

load

...

data

...

from

...

an

...

HDFS

...

file/directory

...

to

...

the

...

table.

...


Note

...

that

...

loading

...

data

...

from

...

HDFS

...

will

...

result

...

in

...

moving

...

the

...

file/directory.

...

As

...

a

...

result,

...

the

...

operation

...

is

...

almost

...

instantaneous.

...

SQL

...

Operations

...

Example

...

Queries

...

Some

...

example

...

queries

...

are

...

shown

...

below.

...

They

...

are

...

available

...

in

...

build/dist/examples/queries

...

.

...


More

...

are

...

available

...

in

...

the

...

hive

...

sources

...

at

...

ql/src/test/queries/positive

SELECTS and FILTERS

Code Block
}}
h4. SELECTS and FILTERS

{code}
  hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';
{code}

selects

...

column

...

'foo'

...

from

...

all

...

rows

...

of

...

partition

...

ds=2008-08-15

...

of

...

the

...

invites

...

table.

...

The

...

results

...

are

...

not

...


stored

...

anywhere,

...

but

...

are

...

displayed

...

on

...

the

...

console.

...

Note

...

that

...

in

...

all

...

the

...

examples

...

that

...

follow,

...

INSERT

...

(into

...

a

...

hive

...

table,

...

local

...


directory

...

or

...

HDFS

...

directory)

...

is

...

optional.

Code Block
 

{code}
  hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';
{code}

selects

...

all

...

rows

...

from

...

partition

...

ds=2008-08-15

...

of

...

the

...

invites

...

table

...

into

...

an

...

HDFS

...

directory.

...

The

...

result

...

data

...


is

...

in

...

files

...

(depending

...

on

...

the

...

number

...

of

...

mappers)

...

in

...

that

...

directory.

...


NOTE:

...

partition

...

columns

...

if

...

any

...

are

...

selected

...

by

...

the

...

use

...

of

...

*.

...

They

...

can

...

also

...


be

...

specified

...

in

...

the

...

projection

...

clauses.

...

Partitioned

...

tables

...

must

...

always

...

have

...

a

...

partition

...

selected

...

in the WHERE clause of the statement.

Code Block
 the {{WHERE}} clause of the statement.

{code}
  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;
{code}

Selects

...

all

...

rows

...

from

...

pokes

...

table

...

into

...

a

...

local

...

directory

{
Code Block
}
  hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;
  hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100; 
  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a;
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a;
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(*) FROM invites a WHERE a.ds='2008-08-15';
  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a;
  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a;
{code}

Sum

...

of

...

a

...

column.

...

avg,

...

min,

...

max

...

can

...

also

...

be

...

used.

...

Note

...

that

...

for

...

versions

...

of

...

Hive which don't include HIVE-287

...

,

...

you'll

...

need

...

to

...

use

...

COUNT(1)

...

in

...

place

...

of COUNT(star).

GROUP BY

Code Block
 {{COUNT(*)}}.
h4. GROUP BY

{code}
  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*) WHERE a.foo > 0 GROUP BY a.bar;
  hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
{code}

Note

...

that

...

for

...

versions

...

of Hive which don't include HIVE-287

...

,

...

you'll

...

need

...

to

...

use

...

COUNT(1)

...

in

...

place

...

of COUNT(star).

JOIN

Code Block
 {{COUNT(*)}}.
h4. JOIN

{code}
  hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;
{code}
h4. MULTITABLE INSERT

MULTITABLE INSERT

Code Block

{code}
  FROM src
  INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
  INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
  INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
  INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;
{code}
h4. STREAMING

{code}

STREAMING

Code Block
  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';
{code}

This

...

streams

...

the

...

data

...

in

...

the

...

map

...

phase

...

through

...

the

...

script

...

/bin/cat

...

(like

...

hadoop

...

streaming).

...


Similarly

...

-

...

streaming

...

can

...

be

...

used

...

on

...

the

...

reduce

...

side

...

(please

...

see

...

the

...

Hive

...

Tutorial

...

or

...

examples)

...

Simple

...

Example

...

Use

...

Cases

...

MovieLens

...

User

...

Ratings

...

First,

...

create

...

a

...

table

...

with

...

tab-delimited

...

text

...

file

...

format:

{
Code Block
}
CREATE TABLE u_data (
  userid INT,
  movieid INT,
  rating INT,
  unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
{code}

Then,

...

download

...

and

...

extract

...

the

...

data

...

files:

{
Code Block
}
wget http://www.grouplens.org/system/files/ml-data.tar+0.gz
tar xvzf ml-data.tar+0.gz

And load it into the table that was just created:

Code Block
{code}
 
And load it into the table that was just created:
{code}
LOAD DATA LOCAL INPATH 'ml-data/u.data'
OVERWRITE INTO TABLE u_data;
{code}

Count

...

the

...

number

...

of

...

rows

...

in

...

table

...

u_data:

{
Code Block
}
SELECT COUNT(*) FROM u_data;
{code}

Note

...

that

...

for

...

versions

...

of

...

Hive which don't include HIVE-287

...

,

...

you'll

...

need

...

to

...

use

...

COUNT(1)

...

in

...

place

...

of

...

COUNT

...

(star).

Now we can do some complex data analysis on the table u_data:

Create weekday_mapper.py

...

:

{
Code Block
}
import sys
import datetime

for line in sys.stdin:
  line = line.strip()
  userid, movieid, rating, unixtime = line.split('\t')
  weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
  print '\t'.join([userid, movieid, rating, str(weekday)])
{code}

Use

...

the

...

mapper

...

script:

{
Code Block
}
CREATE TABLE u_data_new (
  userid INT,
  movieid INT,
  rating INT,
  weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

add FILE weekday_mapper.py;

INSERT OVERWRITE TABLE u_data_new
SELECT
  TRANSFORM (userid, movieid, rating, unixtime)
  USING 'python weekday_mapper.py'
  AS (userid, movieid, rating, weekday)
FROM u_data;

SELECT weekday, COUNT(*)
FROM u_data_new
GROUP BY weekday;
{code}

Note

...

that

...

if

...

you're

...

using

...

Hive

...

0.5.0

...

or

...

earlier

...

you

...

will

...

need

...

to

...

use

...

COUNT(1)

...

in

...

place

...

of

...

COUNT

...

(star).

Apache Weblog Data

The format of Apache weblog is customizable, while most webmasters uses the default.
For default Apache weblog, we can create a table with the following command.

More about !RegexSerDe can be found here: http://issues.apache.org/jira/browse/HIVE-662Image Added

Code Block


{code}
add jar ../build/contrib/hive_contrib.jar;

CREATE TABLE apachelog (
  host STRING,
  identity STRING,
  user STRING,
  time STRING,
  request STRING,
  status STRING,
  size STRING,
  referer STRING,
  agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?",
  "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE;
{code}