Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: debug {toc-zone}
Table of Contents
maxLevel4

Configuring Hive

A number of configuration variables in Hive can be used by the administrator to change the behavior for their installations and user sessions. These variables can be configured in any of the following ways, shown in the order of preference:

...

  • Using the set command in the cli for setting session level values for the configuration variable for all statements subsequent to the set command. e.g.
    No Format
  • 
      set hive.exec.scratchdir=/tmp/mydir;
    

...

  • sets the scratch directory (which is used by hive to store temporary output and plans) to /tmp/mydir

...

  • for

...

  • all

...

  • subseq

...

  • Using -hiveconf

...

  • option

...

  • on

...

  • the

...

  • cli

...

  • for

...

  • the

...

  • entire

...

  • session.

...

  • e.g.

...

  • No Format

...

  • 
      bin/hive -hiveconf hive.exec.scratchdir=/tmp/mydir
    

...

  • In hive-site.xml

...

  • .

...

  • This

...

  • is

...

  • used

...

  • for

...

  • setting

...

  • values

...

  • for

...

  • the

...

  • entire

...

  • Hive

...

  • configuration.

...

  • e.g.

...

  • No Format

...

  • 
      <property>
        <name>hive.exec.scratchdir</name>
        <value>/tmp/mydir</value>
        <description>Scratch space for Hive jobs</description>
      </property>
    

...

hive-default.xml.template

...

contains

...

the

...

default

...

values

...

for

...

various

...

configuration

...

variables

...

that

...

come

...

prepackaged

...

in

...

a

...

Hive

...

distribution.

...

In

...

order

...

to

...

override

...

any

...

of

...

the

...

values,

...

create

...

hive-site.xml

...

instead

...

and

...

set

...

the

...

value

...

in

...

that

...

file

...

as

...

shown

...

above.

...

Please

...

note

...

that

...

this

...

template

...

file

...

is

...

not

...

used

...

by

...

Hive

...

at

...

all

...

(as

...

of

...

Hive

...

0.9.0)

...

and

...

so

...

it

...

might

...

be

...

out

...

of

...

date

...

or

...

out

...

of

...

sync

...

with

...

the

...

actual

...

values.

...

The

...

canonical

...

list

...

of

...

configuration

...

options

...

is

...

now

...

only

...

managed

...

in

...

the

...

HiveConf

...

java

...

class.

...

hive-default.xml.template

...

is

...

located

...

in

...

the

...

conf

...

directory

...

in

...

your

...

installation

...

root.

...

hive-site.xml

...

should

...

also

...

be

...

created

...

in

...

the

...

same

...

directory.

...

The

...

administrative

...

configuration

...

variables

...

are

...

listed

...

below

...

.

Temporary Folders

Hive uses temporary folders both on the machine running the Hive client and the default HDFS instance. These folders are used to store per-query temporary/intermediate data sets and are normally cleaned up by the hive client when the query is finished. However, in cases of abnormal hive client termination, some data may be left behind. The configuration details are as follows:

  • On the HDFS cluster this is set to /tmp/hive-<username>

...

  • by

...

  • default

...

  • and

...

  • is

...

  • controlled

...

  • by

...

  • the

...

  • configuration

...

  • variable

...

  • hive.exec.scratchdir

...

  • On

...

  • the

...

  • client

...

  • machine,

...

  • this

...

  • is

...

  • hardcoded

...

  • to

...

  • /tmp/<username>

...

Note

...

that

...

when

...

writing

...

data

...

to

...

a

...

table/partition,

...

Hive

...

will

...

first

...

write

...

to

...

a

...

temporary

...

location

...

on

...

the

...

target

...

table's

...

filesystem

...

(using

...

hive.exec.scratchdir

...

as

...

the

...

temporary

...

location)

...

and

...

then

...

move

...

the

...

data

...

to

...

the

...

target

...

table.

...

This

...

applies

...

in

...

all

...

cases

...

-

...

whether

...

tables

...

are

...

stored

...

in

...

HDFS

...

(normal

...

case)

...

or

...

in

...

file

...

systems

...

like

...

S3

...

or

...

even

...

NFS.

Log Files

Hive client produces logs and history files on the client machine. Please see Error Logs for configuration details.

Derby Server Mode

Derby is the default database for the Hive metastore (Metadata Store). To run Derby as a network server for multiple users, see Hive Using Derby in Server Mode.

Configuration Variables

Broadly the configuration variables for Hive administration are categorized into:

Table of Content Zone

Also see Hive Configuration Properties in the Language Manual for non-administrative configuration variables.

Hive Configuration Variables

h3. Log Files Hive client produces logs and history files on the client machine. Please see [Error Logs|GettingStarted#Error Logs] for configuration details. h3. Derby Server Mode [Derby|http://db.apache.org/derby/] is the default database for the Hive metastore ([Metadata Store|GettingStarted#Metadata Store]). To run Derby as a network server for multiple users, see [Hive Using Derby in Server Mode|HiveDerbyServerMode]. h3. Configuration Variables Broadly the configuration variables for Hive administration are categorized into: {toc-zone|location=top} Also see [Hive Configuration Properties|Configuration Properties] in the [Language Manual|LanguageManual] for non-administrative configuration variables. h4. Hive Configuration Variables || Variable Name || Description || Default Value | |

Variable Name

Description

Default Value

hive.ddl.output.format

|

The

data

format

to

use

for

DDL

output

(e.g.

{{

DESCRIBE

table

}}

).

One

of

"text"

(for

human

readable

text)

or

"json"

(for

a

json

object).

(as

of

Hive

[|https://issues.apache.org/jira/browse/HIVE-2822])| text | |

)

text

hive.exec.script.wrapper

|

Wrapper

around

any

invocations

to

script

operator

e.g.

if

this

is

set

to

python,

the

script

passed

to

the

script

operator

will

be

invoked

as

{{

python

<script

command>

}}

.

If

the

value

is

null

or

not

set,

the

script

is

invoked

as

{{

<script

command>

}}

.

|

null

| |

hive.exec.plan

| |null| |

 

null

hive.exec.scratchdir

|

This

directory

is

used

by

Hive

to

store

the

plans

for

different

map/reduce

stages

for

the

query

as

well

as

to

stored

the

intermediate

outputs

of

these

stages.

|

/tmp/<user.name>/hive

(Hive

0.8.0

and

earlier)

\\


/tmp/hive-<user.name>

(as

of

Hive

0.8.1)

| |

hive.exec.local.scratchdir

|

This

directory

is

used

for

temporary

files

when

Hive

runs

in

local

mode.

(as

of

Hive

[|https://issues.apache.org/jira/browse/HIVE-1577])|

)

/tmp/<user.name>

| |

hive.exec.submitviachild

|

Determines

whether

the

map/reduce

jobs

should

be

submitted

through

a

separate

jvm

in

the

non

local

mode.

|

false

-

By

default

jobs

are

submitted

through

the

same

jvm

as

the

compiler

| |

hive.exec.script.maxerrsize

|

Maximum

number

of

serialization

errors

allowed

in

a

user

script

invoked

through

{{

TRANSFORM

}}

or

{{

MAP

}}

or

{{

REDUCE

}}

constructs.

|

100000

| |

hive.exec.compress.output

|

Determines

whether

the

output

of

the

final

map/reduce

job

in

a

query

is

compressed

or

not.

|

false

| |

hive.exec.compress.intermediate

|

Determines

whether

the

output

of

the

intermediate

map/reduce

jobs

in

a

query

is

compressed

or

not.

|

false

| |

hive.jar.path

|

The

location

of

hive_cli.jar

that

is

used

when

submitting

jobs

in

a

separate

jvm.

| | |

 

hive.aux.jars.path

|

The

location

of

the

plugin

jars

that

contain

implementations

of

user

defined

functions

and

serdes.

| | |

 

hive.partition.pruning

|

A

strict

value

for

this

variable

indicates

that

an

error

is

thrown

by

the

compiler

in

case

no

partition

predicate

is

provided

on

a

partitioned

table.

This

is

used

to

protect

against

a

user

inadvertently

issuing

a

query

against

all

the

partitions

of

the

table.

|

nonstrict

| |

hive.map.aggr

|

Determines

whether

the

map

side

aggregation

is

on

or

not.

|

true

| |

hive.join.emit.interval

| |1000| |

 

1000

hive.map.aggr.hash.percentmemory

| |

 

(float)0.5

| |

hive.default.fileformat

|

Default

file

format

for

CREATE

TABLE

statement.

Options

are

TextFile,

SequenceFile,

RCFile,

and

Orc.

|

TextFile

| |

hive.merge.mapfiles

|

Merge

small

files

at

the

end

of

a

map-only

job.

|

true

| |

hive.merge.mapredfiles

|

Merge

small

files

at

the

end

of

a

map-reduce

job.

|

false

| |

hive.merge.size.per.task

|

Size

of

merged

files

at

the

end

of

the

job.

|

256000000

| |

hive.merge.smallfiles.avgsize

|

When

the

average

output

file

size

of

a

job

is

less

than

this

number,

Hive

will

start

an

additional

map-reduce

job

to

merge

the

output

files

into

bigger

files.

This

is

only

done

for

map-only

jobs

if

hive.merge.mapfiles

is

true,

and

for

map-reduce

jobs

if

hive.merge.mapredfiles

is

true.

|16000000| |

16000000

hive.querylog.enable.plan.progress

|

Whether

to

log

the

plan's

progress

every

time

a

job's

progress

is

checked.

These

logs

are

written

to

the

location

specified

by

{{

hive.querylog.location

}}

(as

of

Hive

[|https://issues.apache.org/jira/browse/HIVE-3230])| true | |

)

true

hive.querylog.location

|

Directory

where

structured

hive

query

logs

are

created.

One

file

per

session

is

created

in

this

directory.

If

this

variable

set

to

empty

string

structured

log

will

not

be

created.

|

/tmp/<user.name>

| |

hive.querylog.plan.progress.interval

|

The

interval

to

wait

between

logging

the

plan's

progress

in

milliseconds.

If

there

is

a

whole

number

percentage

change

in

the

progress

of

the

mappers

or

the

reducers,

the

progress

is

logged

regardless

of

this

value.

The

actual

interval

will

be

the

ceiling

of

(this

value

divided

by

the

value

of

{{

hive.exec.counters.pull.interval

}}

)

multiplied

by

the

value

of

{{

hive.exec.counters.pull.interval

}}

i.e.

if

it

is

not

divide

evenly

by

the

value

of

{{

hive.exec.counters.pull.interval

}}

it

will

be

logged

less

frequently

than

specified.

This

only

has

an

effect

if

{{

hive.querylog.enable.plan.progress

}}

is

set

to

{{

true

}}

.

(as

of

Hive

[|https://issues.apache.org/jira/browse/HIVE-3230])| 60000 | |

)

60000

hive.stats.autogather

|

A

flag

to

gather

statistics

automatically

during

the

INSERT

OVERWRITE

command.

(as

of

Hive

[0|https://issues.apache.org/jira/browse/HIVE-1361]) | true | |

)

true

hive.stats.dbclass

|

The

default

database

that

stores

temporary

hive

statistics.

Valid

values

are

{{

hbase

}}

and

{{

jdbc

}}

while

{{

jdbc

}}

should

have

a

specification

of

the

Database

to

use,

separatey

by

a

colon

(e.g.

{{

jdbc:mysql

}}

(as

of

Hive

[|https://issues.apache.org/jira/browse/HIVE-1361]) |

)

jdbc:derby

| |

hive.stats.dbconnectionstring

|

The

default

connection

string

for

the

database

that

stores

temporary

hive

statistics.

(as

of

Hive

[|https://issues.apache.org/jira/browse/HIVE-1361]) |

)

jdbc:derby:;databaseName=TempStatsStore;create=true

| |

hive.stats.jdbcdriver

|

The

JDBC

driver

for

the

database

that

stores

temporary

hive

statistics.

(as

of

Hive

[|https://issues.apache.org/jira/browse/HIVE-1361]) |

)

org.apache.derby.jdbc.EmbeddedDriver

| |

hive.stats.reliable

|

Whether

queries

will

fail

because

stats

cannot

be

collected

completely

accurately.

If

this

is

set

to

true,

reading/writing

from/into

a

partition

may

fail

becuase

the

stats

could

not

be

computed

accurately

(as

of

Hive

[|https://issues.apache.org/jira/browse/HIVE-1653]) | false | |

)

false

hive.enforce.bucketing

|

If

enabled,

enforces

inserts

into

bucketed

tables

to

also

be

bucketed

| false | |

false

hive.variable.substitute

|

Substitutes

variables

in

Hive

statements

which

were

previously

set

using

the

{{

set

}}

command,

system

variables

or

environment

variables.

See

[|https://issues.apache.org/jira/browse/HIVE-1096] for details. (as of Hive 0.7.0) | true | |

for details. (as of Hive 0.7.0)

true

hive.variable.substitute.depth

|

The

maximum

replacements

the

substitution

engine

will

do.

(as

of

Hive

[|https://issues.apache.org/jira/browse/HIVE-2021]) | 40 | h4. Hive Metastore Configuration Variables Please see the [Admin Manual's section on the Metastore|AdminManual MetastoreAdmin] for details. For security configuration (Hive 0.10 and later), see the [Hive Metastore Security section|https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-HiveMetastoreSecurity] in the Language Manual's [Configuration Properties|https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties]. h4. Hive Configuration Variables Used to Interact with Hadoop |*Variable Name*|*Description*|*Default Value*| |hadoop.bin.path|The location of hadoop script which is used to submit jobs to hadoop when submitting through a separate jvm.|

)

40

Hive Metastore Configuration Variables

Please see the Admin Manual's section on the Metastore for details.

For security configuration (Hive 0.10 and later), see the Hive Metastore Security section in the Language Manual's Configuration Properties.

Hive Configuration Variables Used to Interact with Hadoop

Variable Name

Description

Default Value

hadoop.bin.path

The location of hadoop script which is used to submit jobs to hadoop when submitting through a separate jvm.

$HADOOP_HOME/bin/hadoop

| |

hadoop.config.dir

|

The

location

of

the

configuration

directory

of

the

hadoop

installation

|

$HADOOP_HOME/conf

| |

fs.default.name

| |

 

file:///

| |

map.input.file

| |null| |

 

null

mapred.job.tracker

|

The

url

to

the

jobtracker.

If

this

is

set

to

local

then

map/reduce

is

run

in

the

local

mode.

|

local

| |

mapred.reduce.tasks

|

The

number

of

reducers

for

each

map/reduce

stage

in

the

query

plan.

|

1

| |

mapred.job.name

|

The

name

of

the

map/reduce

job

|null| h4. Hive Variables Used to Pass Run Time Information |*Variable Name*|*Description*|*Default Value*| |

null

Hive Variables Used to Pass Run Time Information

Variable Name

Description

Default Value

hive.session.id

|

The

id

of

the

Hive

Session.

| | |

 

hive.query.string

|

The

query

string

passed

to

the

map/reduce

job.

| | |

 

hive.query.planid

|

The

id

of

the

plan

for

the

map/reduce

stage.

| | |

 

hive.jobname.length

|

The

maximum

length

of

the

jobname.

|

50

| |

hive.table.name

|

The

name

of

the

hive

table.

This

is

passed

to

the

user

scripts

through

the

script

operator.

| | |

 

hive.partition.name

|

The

name

of

the

hive

partition.

This

is

passed

to

the

user

scripts

through

the

script

operator.

| | |

 

hive.alias

|

The

alias

being

processed.

This

is

also

passed

to

the

user

scripts

through

the

script

operator.

| | {toc-zone} h2. Configuring HCatalog and WebHCat For information about configuring HCatalog and WebHCat, see: * [HCatalog Installation from Tarball|HCatalog InstallHCat] * [WebHCat Configuration|WebHCat Configure]

 

Configuring HCatalog and WebHCat

For information about configuring HCatalog and WebHCat, see: