Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: retry {toc-zone)
Wiki Markup
{toc:maxLevel=4}


h2. Configuring Hive

A number of configuration variables in Hive can be used by the administrator to change the behavior for their installations and user sessions. These variables can be configured in any of the following ways, shown in the order of preference:
 * Using the set command in the cli for setting session level values for the configuration variable for all statements subsequent to the set command. e.g.
{noformat}
Table of Contents
maxLevel4

Configuring Hive

A number of configuration variables in Hive can be used by the administrator to change the behavior for their installations and user sessions. These variables can be configured in any of the following ways, shown in the order of preference:

...


  set hive.exec.scratchdir=/tmp/mydir;

...

{noformat}
  sets the scratch directory (which is used by hive to store temporary output and plans) to {{/tmp/mydir}} for all subseq

...


 * Using {{-hiveconf}} option on the cli for the entire session. e.g.

...


{noformat

...

}
  bin/hive -hiveconf hive.exec.scratchdir=/tmp/mydir

...

{noformat}
 * In {{hive-site.xml}}. This is used for setting values for the entire Hive configuration. e.g.

...


{noformat

...

}
  <property>
    <name>hive.exec.scratchdir</name>
    <value>/tmp/mydir</value>
    <description>Scratch space for Hive jobs</description>
  </property>

...

{noformat}

{{hive-default.xml.template}} contains the default values for various configuration variables that come prepackaged in a Hive distribution. In order to override any of the values, create {{hive-site.xml}} instead and set the value in that file as shown above. Please note that this template file is not used by Hive at all (as of Hive 0.9.0) and so it might be out of date or out of sync with the actual values. The canonical list of configuration options is now only managed in the {{HiveConf}} java class.

...



{{hive-default.xml.template}} is located in the {{conf}} directory in your installation root. {{hive-site.xml}} should also be created in the same directory.

...



The administrative configuration variables are listed [below

...

Temporary Folders

Hive uses temporary folders both on the machine running the Hive client and the default HDFS instance. These folders are used to store per-query temporary/intermediate data sets and are normally cleaned up by the hive client when the query is finished. However, in cases of abnormal hive client termination, some data may be left behind. The configuration details are as follows:

...

|#Configuration Variables].


h3. Temporary Folders
Hive uses temporary folders both on the machine running the Hive client and the default HDFS instance. These folders are used to store per-query temporary/intermediate data sets and are normally cleaned up by the hive client when the query is finished. However, in cases of abnormal hive client termination, some data may be left behind. The configuration details are as follows:
 * On the HDFS cluster this is set to _/tmp/hive-<username>_ by default and is controlled by the configuration variable _hive.exec.scratchdir

...

_
 * On the client machine, this is hardcoded to _/tmp/<username>

...

_

Note that when writing data to a table/partition, Hive will first write to a temporary location on the target table's filesystem (using hive.exec.scratchdir as the temporary location) and then move the data to the target table. This applies in all cases - whether tables are stored in HDFS (normal case) or in file systems like S3 or even NFS.

...

Log Files

Hive client produces logs and history files on the client machine. Please see Error Logs for configuration details.

Derby Server Mode

Derby is the default database for the Hive metastore (Metadata Store). To run Derby as a network server for multiple users, see Hive Using Derby in Server Mode.

Configuration Variables

Broadly the configuration variables for Hive administration are categorized into:

...




h3. Log Files
Hive client produces logs and history files on the client machine. Please see [Error Logs|GettingStarted#Error Logs] for configuration details.

h3. Derby Server Mode

[Derby|http://db.apache.org/derby/] is the default database for the Hive metastore ([Metadata Store|GettingStarted#Metadata Store]). To run Derby as a network server for multiple users, see [Hive Using Derby in Server Mode|HiveDerbyServerMode].


h3. Configuration Variables

Broadly the configuration variables for Hive administration are categorized into:

{toc-zone|location=top}

Also see [Hive Configuration Properties|Configuration Properties] in the [Language Manual|LanguageManual] for non-administrative configuration variables.

h4. Hive Configuration Variables

|| Variable Name || Description || Default Value |
| hive.ddl.output.format
 | The data format to use for DDL output (e.g. {{DESCRIBE table}}). One of "text" (for human readable text) or "json" (for a json object). (as of Hive [0.9.0
)
|https://issues.apache.org/jira/browse/HIVE-2822])| text |
|hive.exec.script.wrapper
|Wrapper around any invocations to script operator e.g. if this is set to python, the script passed to the script operator will be invoked as {{python <script command>}}. If the value is null or not set, the script is invoked as {{<script command>}}.
|null
|
|hive.exec.plan
| |null|
|hive.exec.scratchdir
|This directory is used by Hive to store the plans for different map/reduce stages for the query as well as to stored the intermediate outputs of these stages.
|/tmp/<user.name>/hive (Hive 0.8.0 and earlier)

 \\ /tmp/hive-<user.name> (as of Hive 0.8.1)
|
|hive.exec.local.scratchdir
|This directory is used for temporary files when Hive runs in local mode. (as of Hive [0.10.0
)
|https://issues.apache.org/jira/browse/HIVE-1577])|/tmp/<user.name>
|
|hive.exec.submitviachild
|Determines whether the map/reduce jobs should be submitted through a separate jvm in the non local mode.
|false - By default jobs are submitted through the same jvm as the compiler
|
|hive.exec.script.maxerrsize
|Maximum number of serialization errors allowed in a user script invoked through {{TRANSFORM}} or {{MAP}} or {{REDUCE}} constructs.
|100000
|
|hive.exec.compress.output
|Determines whether the output of the final map/reduce job in a query is compressed or not.
|false
|
|hive.exec.compress.intermediate
|Determines whether the output of the intermediate map/reduce jobs in a query is compressed or not.
|false
|
|hive.jar.path
|The location of hive_cli.jar that is used when submitting jobs in a separate jvm.
| |
|hive.aux.jars.path
|The location of the plugin jars that contain implementations of user defined functions and serdes.
| |
|hive.partition.pruning
|A strict value for this variable indicates that an error is thrown by the compiler in case no partition predicate is provided on a partitioned table. This is used to protect against a user inadvertently issuing a query against all the partitions of the table.
|nonstrict
|
|hive.map.aggr
|Determines whether the map side aggregation is on or not.
|true
|
|hive.join.emit.interval
| |1000|
|hive.map.aggr.hash.percentmemory
| |(float)0.5
|
|hive.default.fileformat
|Default file format for CREATE TABLE statement. Options are TextFile, SequenceFile, RCFile, and Orc.
|TextFile
|
|hive.merge.mapfiles
|Merge small files at the end of a map-only job.
|true
|
|hive.merge.mapredfiles
|Merge small files at the end of a map-reduce job.
|false
|
|hive.merge.size.per.task
|Size of merged files at the end of the job.
|256000000
|
|hive.merge.smallfiles.avgsize
|When the average output file size of a job is less than this number, Hive will start an additional map-reduce job to merge the output files into bigger files.  This is only done for map-only jobs if hive.merge.mapfiles is true, and for map-reduce jobs if hive.merge.mapredfiles is true.
|16000000|
| hive.querylog.enable.plan.progress
 | Whether to log the plan's progress every time a job's progress is checked. These logs are written to the location specified by {{hive.querylog.location}} (as of Hive [0.10
)
|https://issues.apache.org/jira/browse/HIVE-3230])| true |
| hive.querylog.location
 | Directory where structured hive query logs are created. One file per session is created in this directory. If this variable set to empty string structured log will not be created.
| /tmp/<user.name>
 |
| hive.querylog.plan.progress.interval
 | The interval to wait between logging the plan's progress in milliseconds. If there is a whole number percentage change in the progress of the mappers or the reducers, the progress is logged regardless of this value. The actual interval will be the ceiling of (this value divided by the value of {{hive.exec.counters.pull.interval}}) multiplied by the value of {{hive.exec.counters.pull.interval}} i.e. if it is not divide evenly by the value of {{hive.exec.counters.pull.interval}} it will be logged less frequently than specified. This only has an effect if {{hive.querylog.enable.plan.progress}} is set to {{true}}. (as of Hive [0.10
)
|https://issues.apache.org/jira/browse/HIVE-3230])| 60000 |
| hive.stats.autogather
 | A flag to gather statistics automatically during the INSERT OVERWRITE command. (as of Hive [0.7.0
)
|https://issues.apache.org/jira/browse/HIVE-1361]) | true |
| hive.stats.dbclass
 | The default database that stores temporary hive statistics. Valid values are {{hbase}} and {{jdbc
while jdbc should have a specification of the Database to use, separatey by a colon
}} while {{jdbc}} should have a specification of the Database to use, separatey by a colon (e.g. {{jdbc:mysql}} (as of Hive [0.7.0
)
|https://issues.apache.org/jira/browse/HIVE-1361]) | jdbc:derby
 |
| hive.stats.dbconnectionstring
 | The default connection string for the database that stores temporary hive statistics. (as of Hive [0.7.0
)
|https://issues.apache.org/jira/browse/HIVE-1361]) | jdbc:derby:;databaseName=TempStatsStore;create=true
 |
| hive.stats.jdbcdriver
 | The JDBC driver for the database that stores temporary hive statistics. (as of Hive [0.7.0
)
|https://issues.apache.org/jira/browse/HIVE-1361]) | org.apache.derby.jdbc.EmbeddedDriver
 |
| hive.stats.reliable
 | Whether queries will fail because stats cannot be collected completely accurately. If this is set to true, reading/writing from/into a partition may fail becuase the stats could not be computed accurately (as of Hive [0.10.0
)
|https://issues.apache.org/jira/browse/HIVE-1653]) | false |
|hive.enforce.bucketing
 | If enabled, enforces inserts into bucketed tables to also be bucketed
 | false |
| hive.variable.substitute
 | Substitutes variables in Hive statements which were previously set using the {{set}} command, system variables or environment variables. See [HIVE-1096
for details. (as of Hive 0.7.0)
|https://issues.apache.org/jira/browse/HIVE-1096] for details. (as of Hive 0.7.0) | true |
| hive.variable.substitute.depth
 | The maximum replacements the substitution engine will do. (as of Hive [0.10.0
)

Hive Metastore Configuration Variables

Please see the Admin Manual's section on the Metastore for details.

For security configuration (Hive 0.10 and later), see the Hive Metastore Security section in the Language Manual's Configuration Properties.

Hive Configuration Variables Used to Interact with Hadoop

|https://issues.apache.org/jira/browse/HIVE-2021]) | 40 |


h4. Hive Metastore Configuration Variables

Please see the [Admin Manual's section on the Metastore|AdminManual MetastoreAdmin] for details.

For security configuration (Hive 0.10 and later), see the [Hive Metastore Security section|https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-HiveMetastoreSecurity] in the Language Manual's [Configuration Properties|https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties].

h4. Hive Configuration Variables Used to Interact with Hadoop

|*Variable Name*|*Description*|*Default Value*|
|hadoop.bin.path|The location of hadoop script which is used to submit jobs to hadoop when submitting through a separate jvm.|$HADOOP_HOME/bin/hadoop
|
|hadoop.config.dir
|The location of the configuration directory of the hadoop installation
|$HADOOP_HOME/conf
|
|fs.default.name
| |file://
/
/|
|map.input.file
| |null|
|mapred.job.tracker
|The url to the jobtracker. If this is set to local then map/reduce is run in the local mode.
|local
|
|mapred.reduce.tasks
|The number of reducers for each map/reduce stage in the query plan.
|1
|
|mapred.job.name
|The name of the map/reduce job

Hive Variables Used to Pass Run Time Information

|null|


h4. Hive Variables Used to Pass Run Time Information

|*Variable Name*|*Description*|*Default Value*|
|hive.session.id
|The id of the Hive Session.
| |
|hive.query.string
|The query string passed to the map/reduce job.
| |
|hive.query.planid
|The id of the plan for the map/reduce stage.
| |
|hive.jobname.length
|The maximum length of the jobname.
|50
|
|hive.table.name
|The name of the hive table. This is passed to the user scripts through the script operator.
| |
|hive.partition.name
|The name of the hive partition. This is passed to the user scripts through the script operator.
| |
|hive.alias
|The alias being processed. This is also passed to the user scripts through the script operator.
Table of Content Zone
maxLevel3
locationtop

Also see Hive Configuration Properties in the Language Manual for non-administrative configuration variables.

Hive Configuration Variables

Variable Name

Description

Default Value

text

 

null

 

 

 

1000

 

16000000

true

60000

true

false

false

true

40

Variable Name

Description

Default Value

hadoop.bin.path

The location of hadoop script which is used to submit jobs to hadoop when submitting through a separate jvm.

 

 

null

null

Variable Name

Description

Default Value

 

 

 

 

 

 

Configuring HCatalog and WebHCat

For information about configuring HCatalog and WebHCat, see:

| |

{toc-zone}


h2. Configuring HCatalog and WebHCat
For information about configuring HCatalog and WebHCat, see:

* [HCatalog Installation from Tarball|HCatalog InstallHCat]
* [WebHCat Configuration|WebHCat Configure].