Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Updated configuration section for clarity

Table of Contents

Hive on Spark provides Hive with the ability to utilize Apache Spark as its execution engine.

set hive.execution.engine=spark;

Hive on Spark is available from Hive 1.1+ onward is now part of Hive since Hive 1.1 release. It is still under active development in "spark" branchand "spark2" branches, and is periodically merged into the "master" branch for Hive.  See  
See HIVE-7292 and its subtasks sub-tasks and linked issues.

Spark Installation

Follow instructions to install Spark:  

 YARN Mode: http://spark.apache.org/docs/latest/running-on-yarn.html (or
Standalone Mode: https://spark.apache.org/docs/latest/spark-standalone.html, if you are running Spark Standalone mode).

Hive on Spark supports Spark on Yarn YARN mode as default. In particular, for the installation you'll need to

For the installation perform the following tasks:

  1. Install Spark (either download pre-built Spark, or build assembly from source).  
    • Install/build a compatible version.  Hive root pom.xml's <spark.version> defines what version of Spark it was built/tested with. 
    • Install/build a compatible distribution.  Each version of Spark has several distributions, corresponding with different versions of Hadoop.
    • Once Spark is installed, find and keep note of the <spark-assembly-*.jar> location.
    • Note that you must have a version of Spark which does not include the Hive jars. Meaning one which was not built with the Hive profile. If you will use Parquet tables, it's recommended to also enable the "parquet-provided" profile. Otherwise there could be conflicts in Parquet dependency. To remove Hive jars from the installation, simply use the following command under your Spark repository:

      Code Block
      languagebash
      ./make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"
  2. Start Spark cluster (both standalone and Spark on YARN are supported).
    • Keep note of the <Spark Master URL>.  This can be found in Spark master WebUI.

Configuring

...

YARN

Instead of the capacity scheduler, the fair scheduler is required.  This fairly distributes an equal share of resources for jobs in the YARN cluster.

yarn.resourcemanager.scheduler.class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler

...

  1. There are several ways to add the Spark dependency to Hive:
    1. Set the property 'spark.home' to point to the Spark installation:

      Code Block
      hive> set spark.home=/location/to/sparkHome;
    2. Define the SPARK_HOME environment variable before starting Hive CLI/HiveServer2:

      Code Block
      languagebash
      export SPARK_HOME=/usr/lib/spark....
    3. Link the spark-assembly jar to HIVE_HOME/lib.

  2. Configure Hive execution engine to use Spark:

    Code Block
    hive> set hive.execution.engine=spark;

    See the Spark section of Hive Configuration Properties for other properties for configuring Hive and the Remote Spark Driver.

     

  3. Configure Spark-application configs for Hive.  See: http://spark.apache.org/docs/latest/configuration.html.  This can be done either by adding a file "spark-defaults.conf" with these properties to the Hive classpath, or by setting them on Hive configuration (hive-site.xml). For instance:

    Code Block
    hive> set spark.master=<Spark Master URL>
    
    hive> set spark.eventLog.enabled=true;
    
    hive> set spark.eventLog.dir=<Spark event log folder (must exist)>
    
    hive> set spark.executor.memory=512m;              
    
    hive> set spark.serializer=org.apache.spark.serializer.KryoSerializer;
    A little explanation for some of the configuration properties:

    Configuration property details

    • spark.executor.memoryAmount of memory to use per executor process.
    • spark.executor.cores: Number of cores per executor.
    • spark.yarn.executor.memoryOverhead: The amount of off heap memory (in megabytes) to be allocated per executor, when running Spark on Yarn. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. In addition to the executor's memory, the container in which the executor is launched needs some extra memory for system processes, and this is what this overhead is for.

    • spark.executor.instances: The number of executors assigned to each application.
    • spark.driver.memory: The amount of memory assigned to the Remote Spark Context (RSC). We recommend 4GB.
    • spark.yarn.driver.memoryOverhead: We recommend 400 (MB).

...

  • More executor memory means it can enable mapjoin optimization for more queries.

  • More executor memory, on the other hand, becomes unwieldy from GC perspective.

  • Some experiments shows that HDFS client doesn’t handle concurrent writers well, so it may face race condition if executor cores are too many.

The following settings need to be tuned for the cluster, these may also apply to submission of Spark jobs outside of Hive on Spark:

PropertyRecommendation
spark.executor.coresBetween 5-7, See tuning details section
spark.executor.memoryyarn.nodemanager.resource.memory-mb * (spark.executor.cores / yarn.nodemanager.resource.cpu-vcores) 
spark.yarn.executor.memoryOverhead15-20% of spark.executor.memory
spark.executor.instancesDepends on spark.executor.memory + spark.yarn.executor.memoryOverhead, see tuning details section.

Tuning Details

When running Spark on Yarn YARN mode, we generally recommend setting spark.executor.cores to be 5, 6 or 7, depending on what the typical node is divisible by. For instance, if yarn.nodemanager.resource.cpu-vcores is 19, then 6 is a better choice (all executors can only have the same number of cores, here if we chose 5, then every executor only gets 3 cores; if we chose 7, then only 2 executors are used, and 5 cores will be wasted). If it’s 20, then 5 is a better choice (since this way you’ll get 4 executors, and no core is wasted).

...

Recommended Configuration

...

See HIVE-9153 for details on these settings.

No Format

mapreduce.input.fileinputformat.split.maxsize=750000000
hive.vectorized.execution.enabled=true

hive.cbo.enable=true
hive.optimize.reducededuplication.min.reducer=4
hive.optimize.reducededuplication=true
hive.orc.splits.include.file.footer=false
hive.merge.mapfiles=true
hive.merge.sparkfiles=false
hive.merge.smallfiles.avgsize=16000000
hive.merge.size.per.task=256000000
hive.merge.orcfile.stripe.level=true
hive.auto.convert.join=true
hive.auto.convert.join.noconditionaltask=true
hive.auto.convert.join.noconditionaltask.size=894435328
hive.optimize.bucketmapjoin.sortedmerge=false
hive.map.aggr.hash.percentmemory=0.5
hive.map.aggr=true
hive.optimize.sort.dynamic.partition=false
hive.stats.autogather=true
hive.stats.fetch.column.stats=true
hive.vectorized.execution.reduce.enabled=false
hive.vectorized.groupby.checkinterval=4096
hive.vectorized.groupby.flush.percent=0.1
hive.compute.query.using.stats=true
hive.limit.pushdown.memory.usage=0.4
hive.optimize.index.filter=true
hive.exec.reducers.bytes.per.reducer=67108864
hive.smbjoin.cache.rows=10000
hive.exec.orc.default.stripe.size=67108864
hive.fetch.task.conversion=more
hive.fetch.task.conversion.threshold=1073741824
hive.fetch.task.aggr=false
mapreduce.input.fileinputformat.list-status.num-threads=5
spark.kryo.referenceTracking=false
spark.kryo.classesToRegister=org.apache.hadoop.hive.ql.io.HiveKey,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch

See Spark section of configuration page for additional properties.

Design documents