Hive on Spark: Getting Started

Spark Installation

Follow instructions to install Spark: https://spark.apache.org/docs/latest/spark-standalone.html. In particular:

Install Spark (either download pre-built Spark, or build assembly from source).
- Install/build a compatible version. Hive root pom.xml's <spark.version> defines what version of Spark it was built/tested with.
- Install/build a compatible distribution. Each version of Spark has several distributions, corresponding with different versions of Hadoop.
- Once Spark is installed, find and keep note of the <spark-assembly-*.jar> location.
- If you download Spark pre-built you will need to replace the Spark 1.2.x assembly with http://ec2-50-18-79-139.us-west-1.compute.amazonaws.com/data/spark-assembly-1.2.0-SNAPSHOT-hadoop2.3.0-cdh5.1.2.jar
Start Spark cluster (both standalone and Spark on YARN are supported).
- Keep note of the <Spark Master URL>. This can be found in Spark master WebUI.

Configuring Hive

As Hive on Spark is still in development, currently only a Hive assembly built from the Hive/Spark development branch supports Spark execution. The development branch is located here: https://github.com/apache/hive/tree/spark. Checkout the branch and build the Hive assembly as described in https://cwiki.apache.org/confluence/display/Hive/HiveDeveloperFAQ.
If you download Spark, make sure you use a 1.2.x assembly: http://ec2-50-18-79-139.us-west-1.compute.amazonaws.com/data/spark-assembly-1.2.0-SNAPSHOT-hadoop2.3.0-cdh5.1.2.jar
There are several ways to add the Spark dependency to Hive:
1. Set the property 'spark.home' to point to the Spark installation:
```
hive> set spark.home=/location/to/sparkHome;
```
2. Set the spark-assembly jar on the Hive auxpath:
```
hive --auxpath /location/to/spark-assembly-*.jar
```
3. Add the spark-assembly jar for the current user session:
```
hive> add jar /location/to/spark-assembly-*.jar;
```
4. Link the spark-assembly jar to HIVE_HOME/lib.
Please note b and c are not recommended because they cause Spark to ship the spark-assembly jar to each executor when you run queries.
Configure Hive execution to Spark:
```
hive> set hive.execution.engine=spark;
```
Configure Spark-application configs for Hive. See: http://spark.apache.org/docs/latest/configuration.html. This can be done either by adding a file "spark-defaults.conf" with these properties to the Hive classpath, or by setting them on Hive configuration:
```
hive> set spark.master=<Spark Master URL>

hive> set spark.eventLog.enabled=true;             

hive> set spark.executor.memory=512m;              

hive> set spark.serializer=org.apache.spark.serializer.KryoSerializer;
```

Common Issues (Green are resolved, will be removed from this list)

Issue	Cause	Resolution
Error: Could not find or load main class org.apache.spark.deploy.SparkSubmit	Spark dependency not correctly set.	Add Spark dependency to Hive, see Step 3 above.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5.0:0 had a not serializable result: java.io.NotSerializableException: org.apache.hadoop.io.BytesWritable	Spark serializer not set to Kryo.	Set spark.serializer to be org.apache.spark.serializer.KryoSerializer, see Step 5 above.
[ERROR] Terminal initialization failed; falling back to unsupported java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected	Hive has upgraded to Jline2 but jline 0.94 exists in the Hadoop lib.	Delete jline from the Hadoop lib directory (it's only pulled in transitively from zk). export HADOOP_USER_CLASSPATH_FIRST=true If this error occurs during mvn test, perform a mvn clean install on the root project and itests directory.
java.lang.SecurityException: class "javax.servlet.DispatcherType"'s signer information does not match signer information of other classes in the same package at java.lang.ClassLoader.checkCerts(ClassLoader.java:952)	Two versions of the servlet-api are in the classpath.	This should be fixed by HIVE-8905. Remove the servlet-api-2.5.jar under hive/lib.

Space shortcuts

Child pages

Hive on Spark: Getting Started

Spark Installation

Configuring Hive

Common Issues (Green are resolved, will be removed from this list)