Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Follow instructions to install Spark: https://spark.apache.org/docs/latest/spark-standalone.html (or http://spark.apache.org/docs/latest/running-on-yarn.html, if you are running Spark on Yarn).  In particular:

  1. Install Spark (either download pre-built Spark, or build assembly from source).  
    • Install/build a compatible version.  Hive root pom.xml's <spark.version> defines what version of Spark it was built/tested with. 
    • Install/build a compatible distribution.  Each version of Spark has several distributions, corresponding with different versions of Hadoop.
    • Once Spark is installed, find and keep note of the <spark-assembly-*.jar> location.
    • Note that you must have a version of Spark which does not include the Hive jars. Meaning one which was not built with the Hive profile. To remove Hive jars from, simply use the following command under your Spark repository:

      Code Block
      languagebash
      ./make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.4"
  2. Start Spark cluster (both standalone and Spark on YARN are supported).
    • Keep note of the <Spark Master URL>.  This can be found in Spark master WebUI.

...

  1. As Hive on Spark is still in development, currently only a Hive assembly built from the Hive/Spark development branch supports Spark execution.  The development branch is located here: https://github.com/apache/hive/tree/spark.  Checkout the branch and build the Hive assembly as described in https://cwiki.apache.org/confluence/display/Hive/HiveDeveloperFAQ.
  2. If you download Spark, make sure you use a 1.2.x assembly: http://ec2-50-18-79-139.us-west-1.compute.amazonaws.com/data/spark-assembly-1.2.0-SNAPSHOT-hadoop2.3.0-cdh5.1.2.jar
  3. There are several ways to add the Spark dependency to Hive:

    1. Set the property 'spark.home' to point to the Spark installation:

      Code Block
      hive> set spark.home=/location/to/sparkHome;
    2. Define the SPARK_HOME environment variable before starting Hive CLI/HiveServer2:

      Code Block
      languagebash
      export SPARK_HOME=/usr/lib/spark....
    3. Set the spark-assembly jar on the Hive auxpath:

      Code Block
      languagebash
      hive --auxpath /location/to/spark-assembly-*.jar
    4. Add the spark-assembly jar for the current user session:

      Code Block
      hive> add jar /location/to/spark-assembly-*.jar;
    5. Link the spark-assembly jar to HIVE_HOME/lib.

    Please note c and d are not recommended because they cause Spark to ship the spark-assembly jar to each executor when you run queries.

  4. Configure Hive execution to Spark:

    Code Block
    hive> set hive.execution.engine=spark;
  • Configure Spark-application configs for Hive.  See: http://spark.apache.org/docs/latest/configuration.html.  This can be done either by adding a file "spark-defaults.conf" with these properties to the Hive classpath, or by setting them on Hive configuration:

    Code Block
    hive> set spark.master=<Spark Master URL>
    
    hive> set spark.eventLog.enabled=true;             
    
    hive> set spark.executor.memory=512m;              
    
    hive> set spark.serializer=org.apache.spark.serializer.KryoSerializer;

    A little explanation for some of the configuration properties:

    • spark.executor.memory: Amount of memory to use per executor process.
    • spark.executor.cores: number of cores per executor.
    • spark.yarn.executor.memoryOverhead: The amount of off heap memory (in megabytes) to be allocated per executor, when running Spark on Yarn. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc.  In addition to the executor's memory, the container in which the executor is launched needs some extra memory for system processes, and this is what this overhead is for.

    • spark.executor.instances: The number of executors assigned to each application.

    Setting executor memory size is more complicated than simply setting it to be as large as possible. There are several things that need to be taken into consideration:
    • More executor memory means it can enable mapjoin optimization for more queries.

    • More executor memory, on the other hand, become unwieldy from GC perspective.

    • Some experiments shows that HDFS client doesn’t handle concurrent writers well, so it may face race condition if executor cores is too many.

    When running Spark on Yarn mode, we generally recommend you to set spark.executor.cores to be 5, 6 or 7, depending on what the typical node is divisible by. For instance, if yarn.nodemanager.resource.cpu-vcores is 19, then 6 is a better choice (all executors can only have the same number of cores, here if we chose 5, then every executor only get 3 cores; if we chose 7, then only 2 executors are used, and 5 cores will be wasted). If it’s 20, then 5 is a better choice (since this way you’ll get 4 executors, and no core is wasted).
     

    For spark.executor.memory, we recommend to set it to yarn.nodemanager.resource.memory-mb * (spark.executor.cores / yarn.nodemanager.resource.cpu-vcores), then split that between spark.executor.memory and spark.yarn.executor.memoryOverhead. Usually it’s recommended to set spark.yarn.executor.memoryOverhead to be around 15-20% of the total memory.

    After you’ve decided on how much memory each executor receives, you need to decide how many executors will be allocated to queries. In the GA release Spark dynamic executor allocation will be supported. However for this beta only static resource allocation can be used. Based on the physical memory in each node and the configuration of spark.executor.memory and spark.yarn.executor.memoryOverhead you will need to choose the number of instances and set spark.executor.instances.

     
    Now a real world example. Assuming 10 nodes with 64GB of memory per node with 12 virtual cores, e.g. yarn.nodemanager.resource.cpu-vcores=12. One node will be used as the master and as such the cluster will have 9 slave nodes. We’ll configure spark.executor.cores to 6. Given 64GB of ram yarn.nodemanager.resource.memory-mb will be 50GB. We’ll determine the amount of memory for each executor as follows: 50GB * (6/12) = 25GB. We’ll assign 20% to spark.yarn.executor.memoryOverhead, or 5120, and 80% to spark.executor.memory, or 20g.

    On this 9 node cluster we’ll have two executors per host. As such you will configure spark.executor.instances somewhere between 2 and 18. A value of 18 would utilize the entire cluster.

     

Common Issues (Green are resolved, will be removed from this list)

...

No Format
# see HIVE-9153
mapreduce.input.fileinputformat.split.maxsize=750000000
hive.vectorized.execution.enabled=true

hive.cbo.enable=true
hive.optimize.reducededuplication.min.reducer=4
hive.optimize.reducededuplication=true
hive.orc.splits.include.file.footer=false
hive.merge.mapfiles=true
hive.merge.sparkfiles=false
hive.merge.smallfiles.avgsize=16000000
hive.merge.size.per.task=256000000
hive.merge.orcfile.stripe.level=true
hive.auto.convert.join=true
hive.auto.convert.join.noconditionaltask=true
hive.auto.convert.join.noconditionaltask.size=894435328
hive.optimize.bucketmapjoin.sortedmerge=false
hive.map.aggr.hash.percentmemory=0.5
hive.map.aggr=true
hive.optimize.sort.dynamic.partition=false
hive.stats.autogather=true
hive.stats.fetch.column.stats=true
hive.vectorized.execution.reduce.enabled=false
hive.vectorized.groupby.checkinterval=4096
hive.vectorized.groupby.flush.percent=0.1
hive.compute.query.using.stats=true
hive.limit.pushdown.memory.usage=0.4
hive.optimize.index.filter=true
hive.exec.reducers.bytes.per.reducer=67108864
hive.smbjoin.cache.rows=10000
hive.exec.orc.default.stripe.size=67108864
hive.fetch.task.conversion=more
hive.fetch.task.conversion.threshold=1073741824
hive.fetch.task.aggr=false
mapreduce.input.fileinputformat.list-status.num-threads=5
spark.serializer=org.apache.spark.serializer.KryoSerializer

...

Design documents