Hive on Spark: Getting Started

Spark Installation

Install Spark (either download pre-built Spark, or build assembly from source).
- Install/build a compatible version. Hive root pom.xml's <spark.version> defines what version of Spark it was built/tested with.
- Install/build a compatible distribution. Each version of Spark has several distributions, corresponding with different versions of Hadoop.
- Once Spark is installed, find and keep note of the <spark-assembly-*.jar> location.
- If you download Spark pre-built you will need to replace the Spark 1.2.x assembly with http://ec2-50-18-79-139.us-west-1.compute.amazonaws.com/data/spark-assembly-1.2.0-SNAPSHOT-hadoop2.3.0-cdh5.1.2.jar
Start Spark cluster (Master and workers).
- Keep note of the <Spark Master URL>. This can be found in Spark master WebUI.

As Hive on Spark is still in development, currently only a Hive assembly built from the Hive/Spark development branch supports Spark execution. The development branch is located here: https://github.com/apache/hive/tree/spark. Checkout the branch and build the Hive assembly as described in https://cwiki.apache.org/confluence/display/Hive/HiveDeveloperFAQ.
If you download Spark, make sure you use a 1.2.x assembly: http://ec2-50-18-79-139.us-west-1.compute.amazonaws.com/data/spark-assembly-1.2.0-SNAPSHOT-hadoop2.3.0-cdh5.1.2.jar
Start Hive with <spark-assembly-*.jar> on the Hive auxpath:
```
hive --auxpath /location/to/spark-assembly-*.jar
```
Configure Hive execution to Spark:
```
hive> set hive.execution.engine=spark;
```
Configure Spark-application configs for Hive. See: http://spark.apache.org/docs/latest/configuration.html. This can be done either by adding a file "spark-defaults.conf" with these properties to the Hive classpath, or by setting them on Hive configuration:
```
hive> set spark.master=<Spark Master URL>

hive> set spark.eventLog.enabled=true;             

hive> set spark.executor.memory=512m;              

hive> set spark.serializer=org.apache.spark.serializer.KryoSerializer;
```

Issue	Cause	Resolution
java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt (I)Lcom/google/common/hash/HashCode	Guava library version conflict between Spark and Hadoop. See HIVE-7387 and SPARK-2420 for details.	Alternatives until this is fixed: Remove Guava 11 from HADOOP_HOME and replace it with Guava 14. Choose to build Spark assembly manually, apply HIVE-7387-spark.patch to Spark branch before building.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5.0:0 had a not serializable result: java.io.NotSerializableException: org.apache.hadoop.io.BytesWritable	Spark serializer not set to Kryo.	Set spark.serializer to be org.apache.spark.serializer.KryoSerializer as described above.
java.lang.NullPointerException at org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:257) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:224)	Hive is included in the Spark Assembly.	Either build a version of Spark without the "hive" profile or unjar the Spark assembly and rm -rf org/apache/hive org/apache/hadoop/hive and then rejar. The fix is in SPARK-2741.
[ERROR] Terminal initialization failed; falling back to unsupported java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected	Hive has upgraded to Jline2 but jline 0.94 exists in the Hadoop lib	Delete jline from the hadoop lib directory (it's only pulled in transitively from zk) export HADOOP_USER_CLASSPATH_FIRST=true