Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: add versions & reorder hive.mapjoin.* parameters

...

For local mode, memory of the mappers/reducers.

hive.

...

  • Default Value: 0.3
  • Added In: Hive 0.7.0

...

map

...

hive.map.aggr.hash.force.flush.memory.threshold

...

How many values in each keys in the map-joined table should be cached
in memory.

hive

...

.mapjoin.followby.map.aggr.hash.percentmemory
  • Default Value:

...

  •  0.3
  • Added In: Hive 0.

...

  • 7.0

Whether to enable skew join optimization.  (Also see hive.optimize.skewjoin.compiletime.)

...

Portion of total memory to be used by map-side group aggregation hash table, when this group by is followed by map join.

hive.smalltable.filesize
hive.mapjoin.smalltable.filesize
  • Default Value:

...

  •  25000000
  • Added In: Hive 0.

...

Determine if we get a skew key in join. If we see more than the specified number of rows with the same key in join operator, we think the key as a skew join key.

hive.skewjoin.mapjoin.map.tasks

...

  • 7.0 with HIVE-1642: hive.smalltable.filesize (replaced by hive.mapjoin.smalltable.filesize in Hive 0.8.1)
  • Added In: Hive 0.8.1 with HIVE-2499: hive.mapjoin.smalltable.filesize

The threshold for the input file size of the small tables; if the file size is smaller than this threshold, it will try to convert the common join into map join.

hive.mapjoin.localtask.max.memory.usage
  • Default Value: 0.90
  • Added In: Hive 0.

...

  • 7.0

...

Determine the number of map task used in the follow up map join job for a skew join. It should be used together with hive.skewjoin.mapjoin.min.split to perform a fine grained control.

hive.skewjoin.mapjoin.min.split

...

This number means how much memory the local task can take to hold the key/value into in-memory hash table; If the local task's memory usage is more than this number, the local task will be aborted. It means the data of small table is too large to be held in memory.

hive.mapjoin.followby.gby.localtask.max.memory.usage
  • Default Value: 0.55
  • Added In: Hive 0.

...

  • 7.0

...

This number means how much memory the local task can take to hold the key/value into in-memory hash table when this map join followed by a group by; If the local task's memory usage is more than this number, the local task will be aborted. It means the data of small table is too large to be held in the memory.

hive.mapjoin.check.memory.rows

The number means after how many rows processed it needs to check the memory usage.

hive.optimize.skewjoin

...

  • Default Value:

...

  • false
  • Added In:

...

  • Hive 0.

...

  • 6.0

Whether to

...

enable skew join optimization.  (Also see hive.optimize.skewjoin.compiletime.)

hive.skewjoin.key
  • Default Value: 100000
  • Added In: Hive 0.6.0

Determine if we get a skew key in join. If we see more than the specified number of rows with the same key in join operator, we think the key as a skew join key.

hive.skewjoin.mapjoin.map.tasks
  • Default Value: 10000
  • Added In: Hive 0.6.0

Determine the number of map task used in the follow up map join job for a skew join. It should be used together with hive.skewjoin.mapjoin.min.split to perform a fine grained control.

hive.skewjoin.mapjoin.min.split
  • Default Value: 33554432
  • Added In: Hive 0.6.0

Determine the number of map task at most used in the follow up map join job for a skew join by specifying the minimum split size. It should be used together with hive.skewjoin.mapjoin.map.tasks to perform a fine grained control.

hive.optimize.skewjoin.compiletime

The main difference between this paramater and hive.optimize.skewjoin is that this parameter uses the skew information stored in the metastore to optimize the plan at compile time itself. If there is no skew information in the metadata, this parameter will not have any effect.
Both hive.optimize.skewjoin.compiletime and hive.optimize.skewjoin should be set to true. (Ideally, hive.optimize.skewjoin should be renamed as hive.optimize.skewjoin.runtime, but for backward compatibility that has not been done.)

If the skew information is correctly stored in the metadata, hive.optimize.skewjoin.compiletime will change the query plan to take care of it, and hive.optimize.skewjoin will be a no-op.

...

  • Default Value: false
  • Added In: Hive  Hive 0.10.0 with HIVE-3276

Whether to remove the union and push the operators between union and the filesink above union. This avoids an extra scan of the output by union. This is independently useful for union queries, and especially useful when hive.optimize.skewjoin.compiletime is set to true, since an extra union is inserted.

The merge is triggered if either of hive.merge.mapfiles or hive.merge.mapredfiles is set to true. If the user has set hive.merge.mapfiles to true and hive.merge.mapredfiles to false, the idea was that the number of reducers are few, so the number of files anyway is small. However, with this optimization, we are increasing the number of files possibly by a big margin. So, we merge aggresively.

hive.mapred.supports.subdirectories
  • Default Value: false
  • Added In: Hive 0.10.0 with HIVE-3276

Whether the version of Hadoop which is running supports sub-directories for tables/partitions. Many Hive optimizations can be applied if the Hadoop version supports sub-directories for tables/partitions. This support was added by MAPREDUCE-1501.

hive.mapred.mode
  • Default Value: nonstrict
  • Added In: Hive 0.3.0

The mode in which the Hive operations are being performed. In strict mode, some risky queries are not allowed to run.

hive.exec.script.maxerrsize
  • Default Value: 100000
  • Added In: Hive 0.2.0

Maximum number of bytes a script is allowed to emit to standard error (per map-reduce task). This prevents runaway scripts from filling logs partitions to capacity.

hive.exec.script.allow.partial.consumption
  • Default Value: false
  • Added In: Hive 0.5.0

When enabled, this option allows a user script to exit successfully without consuming all the data from the standard input.

hive.script.operator.id.env.var
  • Default Value: HIVE_SCRIPT_OPERATOR_ID
  • Added In: Hive 0.5.0

Name of the environment variable that holds the unique script operator ID in the user's transform function (the custom mapper/reducer that the user has specified in the query).

hive.exec.compress.output

...

create a separate plan for skewed keys for the tables in the join. This is based on the skewed keys stored in the metadata. At compile time, the plan is broken into different joins: one for the skewed keys, and the other for the remaining keys. And then, a union is performed for the two joins generated above. So unless the same skewed key is present in both the joined tables, the join for the skewed key will be performed as a map-side join.

The main difference between this paramater and hive.optimize.skewjoin is that this parameter uses the skew information stored in the metastore to optimize the plan at compile time itself. If there is no skew information in the metadata, this parameter will not have any effect.
Both hive.optimize.skewjoin.compiletime and hive.optimize.skewjoin should be set to true. (Ideally, hive.optimize.skewjoin should be renamed as hive.optimize.skewjoin.runtime, but for backward compatibility that has not been done.)

If the skew information is correctly stored in the metadata, hive.optimize.skewjoin.compiletime will change the query plan to take care of it, and hive.optimize.skewjoin will be a no-op.

hive.optimize.union.remove
  • Default Value: false
  • Added In: Hive 0.

...

This controls whether the final outputs of a query (to a local/hdfs file or a Hive table) is compressed. The compression codec and other options are determined from Hadoop configuration variables mapred.output.compress* .

hive.exec.compress.intermediate
  • Default Value: false
  • Added In: Hive 0.2.0

This controls whether intermediate files produced by Hive between multiple map-reduce jobs are compressed. The compression codec and other options are determined from Hadoop configuration variables mapred.output.compress*.

hive.exec.parallel

Whether to remove the union and push the operators between union and the filesink above union. This avoids an extra scan of the output by union. This is independently useful for union queries, and especially useful when hive.optimize.skewjoin.compiletime is set to true, since an extra union is inserted.

The merge is triggered if either of hive.merge.mapfiles or hive.merge.mapredfiles is set to true. If the user has set hive.merge.mapfiles to true and hive.merge.mapredfiles to false, the idea was that the number of reducers are few, so the number of files anyway is small. However, with this optimization, we are increasing the number of files possibly by a big margin. So, we merge aggresively.

hive.mapred.supports.subdirectories
  • Default Value: false
  • Added In: Hive 0.10.0 with HIVE-3276

Whether the version of Hadoop which is running supports sub-directories for tables/partitions. Many Hive optimizations can be applied if the Hadoop version supports sub-directories for tables/partitions. This support was added by MAPREDUCE-1501.

hive.mapred.mode
  • Default Value: nonstrictDefault Value: false
  • Added In: Hive 0.53.0

Whether to execute jobs in parallelThe mode in which the Hive operations are being performed. In strict mode, some risky queries are not allowed to run.

hive.exec.

...

script.

...

maxerrsize
  • Default Value: 8 100000
  • Added In: Hive 0.6.02.0

Maximum number of bytes a script is allowed to emit to standard error (per map-reduce task). This prevents runaway scripts from filling logs partitions to capacityHow many jobs at most can be executed in parallel.

hive.exec

...

.script.allow.partial.consumption
  • Default Value: false
  • Added In: Hive 0.85.0

Whether to provide the row offset virtual columnWhen enabled, this option allows a user script to exit successfully without consuming all the data from the standard input.

hive.

...

script.operator.id.env.var
  • Default Value: false HIVE_SCRIPT_OPERATOR_ID
  • Added In: Hive 0.5.0
  • Removed in: Hive 0.13.0 with HIVE-4518

Whether Hive should periodically update task progress counters during execution. Enabling this allows task progress to be monitored more closely in the job tracker, but may impose a performance penalty. This flag is automatically set to true for jobs with hive.exec.dynamic.partition set to true.

hive.counters.group.name
  • Default Value: HIVE
  • Added In: Hive 0.13.0 with HIVE-4518

...

Name of the environment variable that holds the unique script operator ID in the user's transform function (the custom mapper/reducer that the user has specified in the query).

hive.exec.compress.output
  • Default Value: false
  • Added In: Hive 0.2.0

This controls whether the final outputs of a query (to a local/hdfs file or a Hive table) is compressed. The compression codec and other options are determined from Hadoop configuration variables mapred.output.compress* .

hive.exec.

...

compress.

...

intermediate
  • Default Value: (empty) false
  • Added In: Hive 0.42.0

Comma-separated list of pre-execution hooks to be invoked for each statement. A pre-execution hook is specified as the name of a Java class which implements the org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext interface.This controls whether intermediate files produced by Hive between multiple map-reduce jobs are compressed. The compression codec and other options are determined from Hadoop configuration variables mapred.output.compress*.

hive.exec.

...

parallel
  • Default Value: (empty) false
  • Added In: Hive 0.5.0

Comma-separated list of post-execution hooks to be invoked for each statement. A post-execution hook is specified as the name of a Java class which implements the org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext interface.Whether to execute jobs in parallel.

hive.exec.parallel.

...

thread.

...

number
  • Default Value: (empty)
  • Added In: Hive 0.8.0

Comma-separated list of on-failure hooks to be invoked for each statement. An on-failure hook is specified as the name of Java class which implements the org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext interface.

hive.merge.mapfiles
  • Default Value: true
  • Added In:

Merge small files at the end of a map-only job.

...

  • 8
  • Added In: Hive 0.6.0

How many jobs at most can be executed in parallel.

hive.exec.rowoffset
  • Default Value: false
  • Added In: Hive 0.8.0

Whether to provide the row offset virtual column.

hive.task.progress
  • Default Value: false
  • Added In: Hive 0.5.0
  • Removed in:

Merge small files at the end of a map-reduce job.

hive.mergejob.maponly
  • Default Value: true
  • Added In:

Try to generate a map-only job for merging files if CombineHiveInputFormat is supported.

hive.merge.size.per.task
  • Default Value: 256000000
  • Added In:

Size of merged files at the end of the job.

hive.merge.smallfiles.avgsize
  • Default Value: 16000000
  • Added In:

When the average output file size of a job is less than this number, Hive will start an additional map-reduce job to merge the output files into bigger files. This is only done for map-only jobs if hive.merge.mapfiles is true, and for map-reduce jobs if hive.merge.mapredfiles is true.

hive.mapjoin.smalltable.filesize
  • Default Value: 25000000
  • Added In:

The threshold for the input file size of the small tables; if the file size is smaller than this threshold, it will try to convert the common join into map join.

hive.mapjoin.localtask.max.memory.usage
  • Default Value: 0.90
  • Added In:

This number means how much memory the local task can take to hold the key/value into in-memory hash table; If the local task's memory usage is more than this number, the local task will be aborted. It means the data of small table is too large to be held in memory.

hive.mapjoin.followby.gby.localtask.max.memory.usage
  • Default Value: 0.55
  • Added In:

This number means how much memory the local task can take to hold the key/value into in-memory hash table when this map join followed by a group by; If the local task's memory usage is more than this number, the local task will be aborted. It means the data of small table is too large to be held in the memory.

hive.mapjoin.check.memory.rows
  • Default Value: 100000
  • Added In:

Whether Hive should periodically update task progress counters during execution. Enabling this allows task progress to be monitored more closely in the job tracker, but may impose a performance penalty. This flag is automatically set to true for jobs with hive.exec.dynamic.partition set to true.

hive.counters.group.name
  • Default Value: HIVE
  • Added In: Hive 0.13.0 with HIVE-4518

Counter group name for counters used during query execution. The counter group is used for internal Hive variables (CREATED_FILE, FATAL_ERROR, and so on).

hive.exec.pre.hooks
  • Default Value: (empty)
  • Added In: Hive 0.4.0

Comma-separated list of pre-execution hooks to be invoked for each statement. A pre-execution hook is specified as the name of a Java class which implements the org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext interface.

hive.exec.post.hooks
  • Default Value: (empty)
  • Added In: Hive 0.5.0

Comma-separated list of post-execution hooks to be invoked for each statement. A post-execution hook is specified as the name of a Java class which implements the org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext interface.

hive.exec.failure.hooks
  • Default Value: (empty)
  • Added In: Hive 0.8.0

Comma-separated list of on-failure hooks to be invoked for each statement. An on-failure hook is specified as the name of Java class which implements the org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext interface.

hive.merge.mapfiles
  • Default Value: true
  • Added In:

Merge small files at the end of a map-only job.

hive.merge.mapredfiles
  • Default Value: false
  • Added In:

Merge small files at the end of a map-reduce job.

hive.mergejob.maponly
  • Default Value: true
  • Added In:

Try to generate a map-only job for merging files if CombineHiveInputFormat is supported.

hive.merge.size.per.task
  • Default Value: 256000000
  • Added In:

Size of merged files at the end of the job.

hive.merge.smallfiles.avgsize
  • Default Value: 16000000
  • Added In:

When the average output file size of a job is less than this number, Hive will start an additional map-reduce job to merge the output files into bigger files. This is only done for map-only jobs if hive.merge.mapfiles is true, and for map-reduce jobs if hive.merge.mapredfiles is trueThe number means after how many rows processed it needs to check the memory usage.

hive.heartbeat.interval
  • Default Value: 1000
  • Added In:

...