Page History

...

If true, the raw data size is collected when analyzing tables.

hive.client.stats.

...

publishers

Default Value:

...

(empty)
Added In: Hive 0.

...

8.

...

1 with HIVE-

...

Whether queries will fail because statistics cannot be collected completely accurately. If this is set to true, reading/writing from/into a partition or unpartitioned table may fail because the statistics could not be computed accurately. If it is set to false, the operation will succeed.

...

2446 (patch 2)

Comma-separated list of statistics publishers to be invoked on counters on each job. A client stats publisher is specified as the name of a Java class which implements the org.apache.hadoop.hive.ql.stats.ClientStatsPublisher interface.

hive.client.stats.

...

counters

Default Value: (empty)
Added In: Hive 0.8.1 with 1 with HIVE-2446 (patch 2)

Comma-separated list of statistics publishers to be invoked on counters on each job. A client stats publisher is specified as the name of a Java class which implements the org.apache.hadoop.hive.ql.stats.ClientStatsPublisher interface.

hive.client.stats.counters

...

Subset of counters that should be of interest for hive.client.stats.publishers (when one wants to limit their publishing). Non-display names should be used.

hive.stats.reliable

Default Value: false
Added In: Hive 0.

...

10.

...

0 with HIVE-

...

1653
New Behavior In: Hive 0.13.0 with HIVE-3777

Whether queries will fail because statistics cannot be collected completely accurately. If this is set to true, reading/writing from/into a partition or unpartitioned table may fail because the statistics could not be computed accurately. If it is set to false, the operation will succeed.

In Hive 0.13.0 and later, if hive.stats.reliable is false and statistics could not be computed correctly, the operation can still succeed and update the statistics but it sets a partition property "areStatsAccurate" to false. If the application needs accurate statistics, they can then be obtained in the background.

hive.stats.ndv.error

Default Value: 20.0
Added In: Hive 0.10 with HIVE-1362 (patch 10)

Standard error allowed for NDV estimates, expressed in percentage. This provides a tradeoff between accuracy and compute cost. A lower value for the error indicates higher accuracy and a higher compute cost. (NDV means number of distinct values.)

hive.stats.collect.tablekeys

Default Value: false
Added In: Hive 0.10 with HIVE-3501

Whether join and group by keys on tables are derived and maintained in the QueryPlan. This is useful to identify how tables are accessed and to determine if they should be bucketed.

hive.stats.collect.scancols

Default Value: false
Added In: Hive 0.11 with HIVE-3940

Whether column accesses are tracked in the QueryPlan. This is useful to identify how tables are accessed and to determine if there are wasted columns that can be trimmed.

hive.stats.key.prefix.max.length

Default Value: 200 (Hive 0.11 and 0.12) or 150 (Hive 0.13 and later)
Added In: Hive 0.11 with HIVE-3750

Determines if, when the prefix of the key used for intermediate statistics collection exceeds a certain length, a hash of the key is used instead. If the value < 0 then hashing is never used, if the value >= 0 then hashing is used only when the key prefixes' length exceeds that value. The key prefix is defined as everything preceding the task ID in the key. For counter type statistics, it's maxed by mapreduce.job.counters.group.name.max, which is by default 128.

hive.stats.key.prefix.reserve.length

Default Value: 24
Added In: Hive 0.13 with HIVE-6229

Reserved length for postfix of statistics key. Currently only meaningful for counter type statistics which should keep the length of the full statistics key smaller than the maximum length configured by hive.stats.key.prefix.max.length. For counter type statistics, it should be bigger than the length of LB spec if exists.

hive.stats.max.variable.length

Default Value: 100
Added In: Hive 0.13 with HIVE-5369

If length of variable length data type cannot be determined this length will be used.

To estimate the size of data flowing through operators in Hive/Tez (for reducer estimation etc.), average row size is multiplied with the total number of rows coming out of each operator. Average row size is computed from average column size of all columns in the row. In the absence of column statistics, for variable length columns (like string, bytes, etc.) this value will be used. For fixed length columns their corresponding Java equivalent sizes are used (float -- 4 bytes, double -- 8 bytes, etc.).

hive.stats.list.num.entries

Default Value: 10
Added In: Hive 0.13 with HIVE-5369

To estimate the size of data flowing through operators in Hive/Tez (for reducer estimation etc.), average row size is multiplied with the total number of rows coming out of each operator. Average row size is computed from average column size of all columns in the row. In the absence of column statistics and for variable length complex columns like list, the average number of entries/values can be specified using this configuration property.

hive.stats.map.num.entries

Default Value: 10
Added In: Hive 0.13 with HIVE-5369

To estimate the size of data flowing through operators in Hive/Tez (for reducer estimation etc.), average row size is multiplied with the total number of rows coming out of each operator. Average row size is computed from average column size of all columns in the row. In the absence of column statistics and for variable length complex columns like map, the average number of entries/values can be specified using this configuration property.

hive.stats.map.parallelism

Default Value: 1
Added In: Hive 0.13 with HIVE-5369

The Hive/Tez optimizer estimates the data size flowing through each of the operators. For the GROUPBY operator, to accurately compute the data size map-side parallelism needs to be known. By default, this value is set to 1 since optimizer is not aware of the number of mappers during compile-time. This Hive config can be used to specify the number of mappers to be used for data size computation of GROUPBY operator.

hive.stats.fetch.partition.stats

Default Value: true
Added In: Hive 0.13 with HIVE-6298

Annotation of the operator tree with statistics information requires partition level basic statisitcs like number of rows, data size and file size. Partition statistics are fetched from the metastore. Fetching partition statistics for each needed partition can be expensive when the number of partitions is high. This flag can be used to disable fetching of partition statistics from the metastore. When this flag is disabled, Hive will make calls to the filesystem to get file sizes and will estimate the number of rows from the row schema.

hive.stats.fetch.column.stats

Default Value: false
Added In: Hive 0.13 with HIVE-5898

Annotation of the operator tree with statistics information requires column statisitcs. Column statistics are fetched from the metastore. Fetching column statistics for each needed column can be expensive when the number of columns is high. This flag can be used to disable fetching of column statistics from the metastore.

hive.stats.join.factor

Default Value: (float) 1.1
Added In: Hive 0.13 with HIVE-5921

The Hive/Tez optimizer estimates the data size flowing through each of the operators. The JOIN operator uses column statistics to estimate the number of rows flowing out of it and hence the data size. In the absence of column statistics, this factor determines the amount of rows flowing out of the JOIN operator.

hive.stats.deserialization.factor

Default Value: (float) 1.0
Added In: Hive 0.13 with HIVE-5921

The Hive/Tez optimizer estimates the data size flowing through each of the operators. In the absence of basic statistics like number of rows and data size, file size is used to estimate the number of rows and data size. Since files in tables/partitions are serialized (and optionally compressed) the estimates of number of rows and data size cannot be reliably determined. This factor is multiplied with the file size to account for serialization and compression.

hive.stats.avg.row.size

Default Value: 10000
Added In: Hive 0.13 with HIVE-5921

In the absence of table/partition statistics, average row size will be used to estimate the number of rows/data size

...

.

hive.compute.query.using.stats

...

Space shortcuts

Child pages

Versions Compared

Old Version 46

New Version 47

Key

hive.client.stats.

publishers

hive.client.stats.

counters

hive.client.stats.counters

hive.stats.reliable

hive.stats.ndv.error

hive.stats.collect.tablekeys

hive.stats.collect.scancols

hive.stats.key.prefix.max.length

hive.stats.key.prefix.reserve.length

hive.stats.max.variable.length

hive.stats.list.num.entries

hive.stats.map.num.entries

hive.stats.map.parallelism

hive.stats.fetch.partition.stats

hive.stats.fetch.column.stats

hive.stats.join.factor

hive.stats.deserialization.factor

hive.stats.avg.row.size

hive.compute.query.using.stats