...
hive.spark.optimize.shuffle.serde
- Default Value:
false
- Added In: Hive 3.0.0 with HIVE-15104
If this is set to true, Hive on Spark will register custom serializers for data types in shuffle. This should result in less shuffled data.
hive.merge.sparkfiles
- Default Value:
false
- Added In: Hive 1.1.0 with HIVE-7810
Merge small files at the end of a Spark DAG Transformation.
hive.spark.session.timeout.period
- Default Value: 30 minutes
- Added In: Hive 4.0.0 with HIVE-14162
Amount of time the Spark Remote Driver should wait for a Spark job to be submitted before shutting down. If a Spark job is not launched after this amount of time, the Spark Remote Driver will shutdown, thus releasing any resources it has been holding onto. The tradeoff is that any new Hive-on-Spark queries that run in the same session will have to wait for a new Spark Remote Driver to startup. The benefit is that for long running Hive sessions, the Spark Remote Driver doesn't unnecessarily hold onto resources. Minimum value is 30 minutes.
hive.spark.session.timeout.period
- Default Value: 60 seconds
- Added In: Hive 4.0.0 with HIVE-14162
How frequently to check for idle Spark sessions. Minimum value is 60 seconds.
hive.spark.use.op.stats
Whether to use operator stats to determine reducer parallelism for Hive on Spark. If this is false, Hive will use source table stats to determine reducer parallelism for all first level reduce tasks, and the maximum reducer parallelism from all parents for all the rest (second level and onward) reducer tasks.
Setting this to false triggers an alternative algorithm for calculating the number of partitions per Spark shuffle. This new algorithm typically results in an increased number of partitions per shuffle.
hive.spark.use.ts.stats.for.mapjoin
If this is set to true, mapjoin optimization in Hive/Spark will use statistics from TableScan operators at the root of operator tree, instead of parent ReduceSink operators of the Join operator. Setting this to true is useful when the operator statistics used for a common join → map join conversion are inaccurate.
hive.spark.use.groupby.shuffle
- Default Value: -1 (disabled)
true
- Added in: Hive 12.13.0 with HIVE-7567
hive.combine.equivalent.work.optimization
When set to true, use Spark's RDD#groupByKey
to perform group bys. When set to false, use Spark's RDD#repartitionAndSortWithinPartitions
to perform group bys. While #groupByKey
has better performance when running group bys, it can use an excessive amount of memory. Setting this to false may reduce memory usage, but will hurt performance.
mapreduce.job.reduces
- Default Value: -1 (disabled)
- Added in: Hive 1.1.0 with HIVE-7567
Sets the number of reduce tasks for each Spark shuffle stage (e.g. the number of partitions when performing a Spark shuffle). This is set to -1 by default (disabled); instead the number of reduce tasks is dynamically calculated based on Hive data statistics. Setting this to a constant value sets the same number of partitions for all Spark shuffle stages.
...