Page History

...

To configure Hive execution to Spark, set the following property to "spark":

hive.execution.engine

Besides the configuration properties listed in this section, some properties in other sections are also related to Spark:

hive.spark.job.monitor.timeout

...

hive.prewarm.enabled
hive.prewarm.numcontainers

hive.spark.optimize.shuffle.serde

Default Value: false
Added In: Hive 3.0.0 with HIVE-15104

If this is set to true, Hive on Spark will register custom serializers for data types in shuffle. This should result in less shuffled data.

hive.merge.sparkfiles

Default Value: false
Added In: Hive 1.1.0 with HIVE-7810

Merge small files at the end of a Spark DAG Transformation.

hive.spark.session.timeout.period

Default Value: 30 minutes
Added In: Hive 4.0.0 with HIVE-14162

Amount of time the Spark Remote Driver should wait for a Spark job to be submitted before shutting down. If a Spark job is not launched after this amount of time, the Spark Remote Driver will shutdown, thus releasing any resources it has been holding onto. The tradeoff is that any new Hive-on-Spark queries that run in the same session will have to wait for a new Spark Remote Driver to startup. The benefit is that for long running Hive sessions, the Spark Remote Driver doesn't unnecessarily hold onto resources. Minimum value is 30 minutes.

hive.spark.session.timeout.period

Default Value: 60 seconds
Added In: Hive 4.0.0 with HIVE-14162

How frequently to check for idle Spark sessions. Minimum value is 60 seconds.

hive.spark.use.op.stats

Default Value: -1 (disabled)
Added in: Hive 1.1.0 with HIVE-7567

hive.spark.use.ts.stats.for.mapjoin

Default Value: -1 (disabled)
Added in: Hive 1.1.0 with HIVE-7567

hive.spark.use.groupby.shuffle

Default Value: -1 (disabled)
Added in: Hive 1.1.0 with HIVE-7567

hive.combine.equivalent.work.optimization

Default Value: -1 (disabled)
Added in: Hive 1.1.0 with HIVE-7567

mapreduce.job.reduces

Default Value: -1 (disabled)
Added in: Hive 1.1.0 with HIVE-7567

Sets the number of reduce tasks for each Spark shuffle stage (e.g. the number of partitions when performing a Spark shuffle). This is set to -1 by default (disabled); instead the number of reduce tasks is dynamically calculated based on Hive data statistics. Setting this to a constant value sets the same number of partitions for all Spark shuffle stages.

Remote Spark Driver

The remote Spark driver is the application launched in the Spark cluster, that submits the actual Spark job. It was introduced in HIVE-8528. It is a long-lived application initialized upon the first query of the current user, running until the user's session is closed. The following properties control the remote communication between the remote Spark driver and the Hive client that spawns it.

...

Space shortcuts

Child pages

Versions Compared

Old Version 588

New Version 589

Key

hive.spark.optimize.shuffle.serde

hive.merge.sparkfiles

hive.spark.session.timeout.period

hive.spark.session.timeout.period

hive.spark.use.op.stats

hive.spark.use.ts.stats.for.mapjoin

hive.spark.use.groupby.shuffle

hive.combine.equivalent.work.optimization

mapreduce.job.reduces

Remote Spark Driver