Page History

...

set hive.execution.engine=spark;

Hive on Spark is available from Hive 1.1+ onward. It is still under active development in "spark" and "spark2" branches, and is periodically merged into the "master" branch for Hive.
See HIVE-7292 and its sub-tasks and linked issues.

Spark Installation

Follow instructions to install Spark:

YARN Mode: http://spark.apache.org/docs/latest/running-on-yarn.html
Standalone Mode: https://spark.apache.org/docs/latest/spark-standalone.html

Hive on Spark supports Spark on YARN mode as default.

For the installation perform the following tasks:

Install Spark (either download pre-built Spark, or build assembly from source).
- Install/build a compatible version. Hive root pom.xml's <spark.version> defines what version of Spark it was built/tested with.
- Install/build a compatible distribution. Each version of Spark has several distributions, corresponding with different versions of Hadoop.
- Once Spark is installed, find and keep note of the <spark-assembly-*.jar> location.
- Note that you must have a version of Spark which does not include the Hive jars. Meaning one which was not built with the Hive profile. If you will use Parquet tables, it's recommended to also enable the "parquet-provided" profile. Otherwise there could be conflicts in Parquet dependency. To remove Hive jars from the installation, simply use the following command under your Spark repository:
  Prior to Spark 2.0.0:
  Code Block
  language bash
  ./make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"
  Since Spark 2.0.0:
  Code Block
  language bash
  ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"
Start Spark cluster
- Keep note of the <Spark Master URL>. This can be found in Spark master WebUI.

Configuring YARN

Instead of the capacity scheduler, the fair scheduler is required. This fairly distributes an equal share of resources for jobs in the YARN cluster.

yarn.resourcemanager.scheduler.class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler

Configuring Hive

was added in HIVE-7292.

Version Compatibility

Hive on Spark is only tested with a specific version of Spark, so a given version of Hive is only guaranteed to work with a specific version of Spark. Other versions of Spark may work with a given version of Hive, but that is not guaranteed. Below is a list of Hive versions and their corresponding compatible Spark versions.

Hive Version	Spark Version
master	2.3.0
3.0.x	2.3.0
2.3.x	2.0.0
2.2.x	1.6.0
2.1.x	1.6.0
2.0.x	1.5.0
1.2.x	1.3.1
1.1.x	1.2.0

Spark Installation

Follow instructions to install Spark:

YARN Mode: http://spark.apache.org/docs/latest/running-on-yarn.html
Standalone Mode: https://spark.apache.org/docs/latest/spark-standalone.html

Hive on Spark supports Spark on YARN mode as default.

For the installation perform the following tasks:

Install Spark (either download pre-built Spark, or build assembly from source).
- Install/build a compatible version. Hive root pom.xml's <spark.version> defines what version of Spark it was built/tested with.
- Install/build a compatible distribution. Each version of Spark has several distributions, corresponding with different versions of Hadoop.
- Once Spark is installed, find and keep note of the <spark-assembly-*.jar> location.
- Note that you must have a version of Spark which does not include the Hive jars. Meaning one which was not built with the Hive profile. If you will use Parquet tables, it's recommended to also enable the "parquet-provided" profile. Otherwise there could be conflicts in Parquet dependency. To remove Hive jars from the installation, simply use the following command under your Spark repository:
  Prior to Spark 2.0.0:
  Code Block
  language bash
  ./make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"
  Since Spark 2.0.0:
  Code Block
  language bash
  ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided"
  Since Spark 2.3.0:
  Code Block
  language bash
  ./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided"
Start Spark cluster
- Keep note of the <Spark Master URL>. This can be found in Spark master WebUI.

Configuring YARN

Instead of the capacity scheduler, the fair scheduler is required. This fairly distributes an equal share of resources for jobs in the YARN cluster.

yarn.resourcemanager.scheduler.class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler

Configuring Hive

To add the Spark dependency to Hive:
- Prior to Hive 2.2.0, link the spark-assembly jar to HIVE_HOME/lib.
- Since Hive 2.2.0, Hive on Spark runs with Spark 2.0.0 and above, which doesn't have an assembly jar.
  - To run with YARN mode (either yarn-client or yarn-cluster), link the following jars to HIVE_HOME/lib.
    - scala-library
    - spark-core
    - spark-network-common
  - To run with LOCAL mode (for debugging only), link the following jars in addition to those above to HIVE_HOME/lib.
    - chill-java chill jackson-module-paranamer jackson-module-scala jersey-container-servlet-core
    - jersey-server json4s-ast kryo-shaded minlog scala-xml spark-launcher
    - spark-network-shuffle spark-unsafe xbean-asm5-shaded
Configure Hive execution engine to use Spark:
Code Block
set hive.execution.engine=spark;
See the Spark section of Hive Configuration Properties for other properties for configuring Hive and the Remote Spark Driver.

Configure Spark-application configs for Hive. See: http://spark.apache.org/docs/latest/configuration.html. This can be done either by adding a file "spark-defaults.conf" with these properties to the Hive classpath, or by setting them on Hive configuration (hive-site.xml). For instance:

Code Block

set spark.master=<Spark Master URL>
set spark.eventLog.enabled=true;
set spark.eventLog.dir=<Spark event log folder (must exist)>
set spark.executor.memory=512m;              
set spark.serializer=org.apache.spark.serializer.KryoSerializer;

Configuration property details

spark.executor.memory: Amount of memory to use per executor process.
spark.executor.cores: Number of cores per executor.
spark.yarn.executor.memoryOverhead: The amount of off heap memory (in megabytes) to be allocated per executor, when running Spark on Yarn. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. In addition to the executor's memory, the container in which the executor is launched needs some extra memory for system processes, and this is what this overhead is for.
spark.executor.instances: The number of executors assigned to each application.
spark.driver.memory: The amount of memory assigned to the Remote Spark Context (RSC). We recommend 4GB.
spark.yarn.driver.memoryOverhead: We recommend 400 (MB).

Allow Yarn to cache necessary spark dependency jars on nodes so that it does not need to be distributed each time when an application runs.
- Prior to Hive 2.2.0, upload spark-assembly jar to hdfs file(for example: hdfs://xxxx:8020/spark-assembly.jar) and add following in hive-site.xml
  Code Block
  <property> <name>spark.yarn.jar</name> <value>hdfs://xxxx:8020/spark-assembly.jar</value> </property>
- Hive 2.2.0, upload all jars in $SPARK_HOME/jars to hdfs folder(for example:hdfs:///xxxx:8020/spark-jars) and add following in hive-site.xml
  Code Block
  <property> <name>spark.yarn.jars</name> <value>hdfs://xxxx:8020/spark-jars/*</value> </property>
To add the Spark dependency to Hive:
- Prior to Hive 2.2.0, link the spark-assembly jar to HIVE_HOME/lib.
- Since Hive 2.2.0, Hive on Spark runs with Spark 2.0.0 and above, which doesn't have an assembly jar.
  - To run with YARN mode (either yarn-client or yarn-cluster), link the following jars to HIVE_HOME/lib.
    - scala-library
    - spark-core
    - spark-network-common
  - To run with LOCAL mode (for debugging only), link the following jars in addition to those above to HIVE_HOME/lib.
    - chill-java chill jackson-module-paranamer jackson-module-scala jersey-container-servlet-core
    - jersey-server json4s-ast kryo-shaded minlog scala-xml spark-launcher
    - spark-network-shuffle spark-unsafe xbean-asm5-shaded
Configure Hive execution engine to use Spark:
Code Block
set hive.execution.engine=spark;
See the Spark section of Hive Configuration Properties for other properties for configuring Hive and the Remote Spark Driver.

Configure Spark-application configs for Hive. See: http://spark.apache.org/docs/latest/configuration.html. This can be done either by adding a file "spark-defaults.conf" with these properties to the Hive classpath, or by setting them on Hive configuration (hive-site.xml). For instance:

Code Block

set spark.master=<Spark Master URL>
set spark.eventLog.enabled=true;
set spark.eventLog.dir=<Spark event log folder (must exist)>
set spark.executor.memory=512m;              
set spark.serializer=org.apache.spark.serializer.KryoSerializer;

Configuration property details

spark.executor.memory: Amount of memory to use per executor process.
spark.executor.cores: Number of cores per executor.
spark.yarn.executor.memoryOverhead: The amount of off heap memory (in megabytes) to be allocated per executor, when running Spark on Yarn. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. In addition to the executor's memory, the container in which the executor is launched needs some extra memory for system processes, and this is what this overhead is for.
spark.executor.instances: The number of executors assigned to each application.
spark.driver.memory: The amount of memory assigned to the Remote Spark Context (RSC). We recommend 4GB.
spark.yarn.driver.memoryOverhead: We recommend 400 (MB).

Configuring Spark

Setting executor memory size is more complicated than simply setting it to be as large as possible. There are several things that need to be taken into consideration:

...

Space shortcuts

Child pages

Versions Compared

Old Version 93

New Version Current

Key

Spark Installation

Configuring YARN

Configuring Hive

Version Compatibility

Spark Installation

Configuring YARN

Configuring Hive

Configuration property details

Configuration property details

Configuring Spark