You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 24 Next »

Hive on Spark: Getting Started

Spark Installation

Follow instructions to install Spark: https://spark.apache.org/docs/latest/spark-standalone.html.  In particular:

  1. Install Spark (either download pre-built Spark, or build assembly from source).  
    • Install/build a compatible version.  Hive root pom.xml's <spark.version> defines what version of Spark it was built/tested with. 
    • Install/build a compatible distribution.  Each version of Spark has several distributions, corresponding with different versions of Hadoop.
    • Once Spark is installed, find and keep note of the <spark-assembly-*.jar> location.
    • If you download Spark pre-built you will need to replace the Spark 1.0.x assembly with http://ec2-50-18-79-139.us-west-1.compute.amazonaws.com/data/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar.
  2. Start Spark cluster (Master and workers).
    • Keep note of the <Spark Master URL>.  This can be found in Spark master WebUI.

Configuring Hive

  1. As Hive on Spark is still in development, currently only a Hive assembly built from the Hive/Spark development branch supports Spark execution.  The development branch is located here: https://github.com/apache/hive/tree/spark.  Checkout the branch and build the Hive assembly as described in https://cwiki.apache.org/confluence/display/Hive/HiveDeveloperFAQ.
  2. If you download Spark, make sure you use a 1.1.x assembly: http://ec2-50-18-79-139.us-west-1.compute.amazonaws.com/data/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar
  3. Start Hive with <spark-assembly-*.jar> on the Hive auxpath:

    hive --auxpath /location/to/spark-assembly-*.jar
  4. Configure Hive execution to Spark:

    hive> set hive.execution.engine=spark;
  5. Configure Spark-application configs for Hive.  See: http://spark.apache.org/docs/latest/configuration.html.  This can be done either by adding a file "spark-defaults.conf" with these properties to the Hive classpath, or by setting them on Hive configuration:

    hive> set spark.master=<Spark Master URL>
    
    hive> set spark.eventLog.enabled=true;             
    
    hive> set spark.executor.memory=512m;              
    
    hive> set spark.serializer=org.apache.spark.serializer.KryoSerializer;

Common Issues

 

IssueCauseResolution

java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt

(I)Lcom/google/common/hash/HashCode

Guava library version conflict between Spark and Hadoop.  See HIVE-7387 and SPARK-2420 for details.

Alternatives until this is fixed:

  1. Remove Guava 11 from HADOOP_HOME and replace it with Guava 14.
  2. Choose to build Spark assembly manually, apply HIVE-7387-spark.patch to Spark branch before building.

org.apache.spark.SparkException: Job aborted due to stage failure:

Task 5.0:0 had a not serializable result: java.io.NotSerializableException: org.apache.hadoop.io.BytesWritable

Spark serializer not set to Kryo.Set spark.serializer to be org.apache.spark.serializer.KryoSerializer as described above.

java.lang.NullPointerException

at org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:257)

at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:224)

Hive is included in the Spark Assembly.Either build a version of Spark without the "hive" profile or unjar the Spark assembly and rm -rf org/apache/hive org/apache/hadoop/hive and then rejar. The fix is in SPARK-2741.

 

 

  • No labels