Useful Developer Tools

Reducing Build Times

Spark's default build strategy is to assemble a jar including all of its dependencies. This can be cumbersome when doing iterative development. When developing locally, it is possible to create an assembly jar including all of Spark's dependencies and then re-package only Spark itself when making changes.

Fast Local Builds

$ sbt/sbt clean assembly # Create a normal assembly
$ ./bin/spark-shell # Use spark with the normal assembly
$ export SPARK_PREPEND_CLASSES=true
$ ./bin/spark-shell # Now it's using compiled classes
# ... do some local development ... #
$ sbt/sbt compile
# ... do some local development ... #
$ sbt/sbt compile
$ unset SPARK_PREPEND_CLASSES
$ ./bin/spark-shell # Back to normal, using Spark classes from the assembly jar
 
# You can also use ~ to let sbt do incremental builds on file changes without running a new sbt session every time
$ sbt/sbt ~compile

Note: in some earlier versions of Spark, fast local builds used a sbt task called assemble-deps; SPARK-1843 removed assemble-deps and introduced the environment variable described above. For those older versions:

Fast Local Builds

$ sbt/sbt clean assemble-deps
$ sbt/sbt package
# ... do some local development ... #
$ sbt/sbt package
# ... do some local development ... #
$ sbt/sbt package
# ...
 
# You can also use ~ to let sbt do incremental builds on file changes without running a new sbt session every time
$ sbt/sbt ~package

Checking Out Pull Requests

Git provides a mechanism for fetching remote pull requests into your own local repository. This is useful when reviewing code or testing patches locally. If you haven't yet cloned the Spark Git repository, use the following command:

$ git clone https://github.com/apache/spark.git
$ cd spark

To enable this feature you'll need to configure the git remote repository to fetch pull request data. Do this by modifying the .git/config file inside of your Spark directory. The remote may not be named "origin" if you've named it something else:

.git/config

[remote "origin"]
  url = git@github.com:apache/spark.git
  fetch = +refs/heads/*:refs/remotes/origin/*
  fetch = +refs/pull/*/head:refs/remotes/origin/pr/*   # Add this line

Once you've done this you can fetch remote pull requests

# Fetch remote pull requests
$ git fetch origin
# Checkout a remote pull request
$ git checkout origin/pr/112
# Create a local branch from a remote pull request
$ git checkout origin/pr/112 -b new-branch

Running Individual Tests

Often it is useful to run individual tests in Maven or SBT.

# sbt
$ sbt/sbt "test-only org.apache.spark.io.CompressionCodecSuite"
$ sbt/sbt "test-only org.apache.spark.io.*"

# Maven
$ mvn test -DwildcardSuites=org.apache.spark.io.CompressionCodecSuite
$ mvn test -DwildcardSuites=org.apache.spark.io.*

Generating Dependency Graphs

$ # sbt
$ sbt/sbt dependency-tree

$ # Maven
$ mvn -DskipTests install
$ mvn dependency:tree

Running Build Targets For Individual Projects

$ # sbt
$ sbt/sbt assembly/assembly
$ # Maven
$ mvn package -DskipTests -pl assembly

Building Spark in IntelliJ IDEA

Many Spark developers use IntelliJ for day-to-day development and testing. Importing Spark into IntelliJ requires a few special steps due to the complexity of the Spark build.

Download IntelliJ and install the Scala plug-in for IntelliJ.
Go to "File -> Import Project", locate the spark source directory, and select "Maven Project".
Click through to the profiles selection, and select the following profiles: yarn, scala-2.10, hadoop-2.4, hive-thriftserver, hive-0.13.1. Click through to create the project.
At the top of the leftmost pane, make sure the "Project/Packages" selector is set to "Packages".
Right click on any package and click “Open Module Settings” - you will be able to modify any of the modules here.
A few of the modules need to be modified slightly from the default import.
1. Add sources to the following modules: Under “Sources” tab add a source on the right.
1. 1. spark-hive: add v0.13.1/src/main/scala
  2. spark-hive-thriftserver v0.13.1/src/main/scala
  3. spark-repl: scala-2.10/src/main/scala and scala-2.10/src/test/scala
1. For spark-yarn click “Add content root” and navigate in the filesystem to yarn/common directory of Spark

Child pages