Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. There are several ways to add the Spark dependency to Hive:
    1. Set the property 'spark.home' to point to the Spark installation:

      Code Block
      hive> set spark.home=/location/to/sparkHome;
    2. Define the SPARK_HOME environment variable before starting Hive CLI/HiveServer2:

      Code Block
      languagebash
      export SPARK_HOME=/usr/lib/spark....
    3. Set the spark-assembly jar on the Hive auxpath:

      Code Block
      languagebash
      hive --auxpath /location/to/spark-assembly-*.jar
    4. Add the spark-assembly jar for the current user session:

      Code Block
      hive> add jar /location/to/spark-assembly-*.jar;
    5. Link the spark-assembly jar to HIVE_HOME/lib.
    Please note c and d are not recommended because they cause Spark to ship the spark-assembly jar to each executor when you run queries
    1. .

  2. Configure Hive execution to Spark:

    Code Block
    hive> set hive.execution.engine=spark;
  3. Configure Spark-application configs for Hive.  See: http://spark.apache.org/docs/latest/configuration.html.  This can be done either by adding a file "spark-defaults.conf" with these properties to the Hive classpath, or by setting them on Hive configuration (hive-site.xml). For instance:

    Code Block
    hive> set spark.master=<Spark Master URL>
    
    hive> set spark.eventLog.enabled=true;
    
    hive> set spark.eventLog.dir=<Spark event log folder (must exist)>
    
    hive> set spark.executor.memory=512m;              
    
    hive> set spark.serializer=org.apache.spark.serializer.KryoSerializer;

    A little explanation for some of the configuration properties:

    • spark.executor.memoryAmount of memory to use per executor process.
    • spark.executor.cores: Number of cores per executor.
    • spark.yarn.executor.memoryOverhead: The amount of off heap memory (in megabytes) to be allocated per executor, when running Spark on Yarn. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. In addition to the executor's memory, the container in which the executor is launched needs some extra memory for system processes, and this is what this overhead is for.

    • spark.executor.instances: The number of executors assigned to each application.
    • spark.driver.memory: The amount of memory assigned to the Remote Spark Context (RSC). We recommend 4GB.
    • spark.yarn.driver.memoryOverhead: We recommend 400 (MB).

...