The Apache Spark team welcomes all types of contributions, whether they be bug reports, documentation, or new patches.
Reporting Issues
If you'd like to report a bug in Spark or ask for a new feature, open an issue on the Apache Spark JIRA. For general usage help, you should email the user mailing list.
Contributing Code
We prefer to receive contributions in the form of GitHub pull requests. Start by opening an issue for your change on the Spark Project JIRA (and make sure to search whether there's an existing issue). For code reviews, we use the github.com/apache/spark repository.
Please follow the steps below to propose a contribution:
- Break your work into small, single-purpose patches if possible. It’s much harder to merge in a large change with a lot of disjoint features.
- Review the criteria for inclusion of patches.
- Create an issue for your patch on the Spark Project JIRA.
- If you are proposing a larger change, attach a design document to your JIRA first (example) and email the dev mailing list to discuss it.
Submit the patch as a GitHub pull request. For a tutorial, see the GitHub guides on forking a repo and sending a pull request. Name your pull request with the JIRA name and include the Spark module or WIP if relevant. NOTE: If you do not reference a JIRA in the title - you may not be credited in our release notes, since our credits are generated by JIRA.
- Follow the Spark Code Style Guide. Before sending in your pull request, you can run
./dev/lint-scala
and./dev/lint-python
to validate the style. - Make sure that your code passes the automated tests (see Automated Testing below)
- Add new tests for your code. We use ScalaTest for testing. Just add a new Suite in
core/src/test
, or methods to an existing Suite. - Update the documentation (in the
docs
folder) if you add a new feature or configuration parameter.
If you’d like to report a bug but don’t have time to fix it, you can still post it to our issue tracker, or email the mailing list.
Tip: Use descriptive names in your pull requests
SPARK-123: Add some feature to Spark
[STREAMING] SPARK-123: Add some feature to Spark streaming
[MLLIB] [WIP] SPARK-123: Some potentially useful feature for MLLib
Criteria for Inclusion or Rejection of Patches
When Spark committers consider a patch for merging, we take several factors into account. Certain types of patches will be reviewed and merged almost instantly: patches that address correctness issues in Spark, are small, and/or benefit a wide number of users are likely to get a lot of attention. Other patches might take more time to review. In a small number of cases, patches are rejected. Patches might be rejected for the following reasons:
- Correctness concerns: If a patch touching a lot of code and it is difficult to verify it's correctness, it might be rejected.
- User space functionality: If a patch is adding features that could exist in a third-party package rather than Spark itself, we sometimes encourage users to publish utilities in their own library. This is especially true for large standalone modules.
- Too complex: Spark desires to have a maintainable and simple codebase. If features are very complex relative to their benefit, they may be rejected.
- Regressing behavior: If a patch regresses behavior that is implicitly or explicitly depended on by users, it might be rejected on this basis.
- Introducing new API's: Patches that propose new public or experimental API's must meet a high bar in Spark due to our API compatibility guidelines.
- Not applicable to enough users: Optimizations or features might be rejected on the basis of being too esoteric and not useful to a broad enough audience.
- Introduction of dependencies: Due to the complex nature of Spark, we are conservative about introducing new dependencies. If patches add new dependencies to Spark, they may not be merged.
Small patches are almost never rejected, so it's a good strategy to start with small patches for new contributors. Keep in mind that Spark committers act as volunteers - patches with major correctness issues might be rejected without significant review, since such review is very costly in terms of time. If this happens consider finding smaller patches or simpler features to contribute, then building up your confidence and abilities over time.
Code Review Process
Community code review is Spark's fundamental quality assurance process. When reviewing a patch, your goal should be to help streamline the committing process by giving committers confidence this patch has been verified by an additional party. It's encouraged to (politely) submit technical feedback to the author to identify areas for improvement or potential bugs.
If you feel a patch is ready for inclusion in Spark, indicate this to committers with a comment such as: "I think this patch looks good". Spark uses the LGTM convention for indicating the strongest level of technical sign-off on a patch: simply comment with the word "LGTM". An LGTM is a strong statement with specific semantics. It should be interpreted as the following: "I've looked at this thoroughly and take as much ownership as if I wrote the patch myself". If you comment LGTM you will be expected to help with bugs or follow-up issues on the patch. Judicious use of LGTM's is a great way to gain credibility as a reviewer with the broader community.
It's also welcome for reviewers to argue against the inclusion of a feature or patch. Simply indicate this in the comments.
Contributing New Algorithms to MLLib
While a rich set of algorithms is an important goal for MLLib, scaling the project requires that maintainability, consistency, and code quality come first. New algorithms should
- Be widely known
- Be used and accepted (academic citations and concrete use cases can help justify this)
- Be highly scalable
- Be well documented
- Have APIs consistent with other algorithms in MLLib that accomplish the same thing
- Come with a reasonable expectation of developer support.
Automated Testing
Spark comes with a fairly comprehensive suite for unit tests, functional tests and integration tests. All pull requests are automatically tested on Jenkins, currently hosted by the Berkeley AMPLab.
To run through the whole suite for tests (along with code style check and binary compatibility checks), run /dev/run-tests.
Starter Tasks
If you are new to Spark and want to contribute, you can browse through the list of starter tasks on our JIRA. These tasks are typically small and simple, and are excellent problems to get you ramped up.
Documentation
If you'd like to contribute documentation, there are two ways:
- To have us add a link to an external tutorial you wrote, simply email the developer mailing list.
- To modify the built-in documentation, edit the MarkDown source files in Spark's
docs
directory, and send a patch against the Spark GitHub repository. The README file indocs
says how to build the documentation locally to test your changes.
Development Discussions
To keep up to date with the latest discussions, join the developer mailing list.
IDE Setup
IntelliJ
While many of the Spark developers use SBT or Maven on the command line, the most common IDE we use is IntelliJ IDEA. You can get the community edition for free (Apache committers can get free IntelliJ Ultimate Edition licenses) and install the JetBrains Scala plugin from Preferences > Plugins.
To create a Spark project for IntelliJ:
- Download IntelliJ and install the Scala plug-in for IntelliJ.
- Go to "File -> Import Project", locate the spark source directory, and select "Maven Project".
- In the Import wizard, it's fine to leave settings at their default. However it is usually useful to enable "Import Maven projects automatically", since changes to the project structure will automatically update the IntelliJ project.
- As documented in Building Spark, some build configurations require specific profiles to be enabled. The same profiles that are enabled with
-P[profile name]
above may be enabled on the Profiles screen in the Import wizard. For example, if developing for Hadoop 2.4 with YARN support, enable profilesyarn
andhadoop-2.4
. These selections can be changed later by accessing the "Maven Projects" tool window from the View menu, and expanding the Profiles section.
Other tips:
- "Rebuild Project" can fail the first time the project is compiled, because generate source files are not automatically generated. Try clicking the "Generate Sources and Update Folders For All Projects" button in the "Maven Projects" tool window to manually generate these sources.
- Compilation may fail with an error like "scalac: bad option: -P:/home/jakub/.m2/repository/org/scalamacros/paradise_2.10.4/2.0.1/paradise_2.10.4-2.0.1.jar". If so, go to Preferences > Build, Execution, Deployment > Scala Compiler and clear the "Additional compiler options" field. It will work then although the option will come back when the project reimports. If you try to build any of the projects using quasiquotes (eg.,
sql
) then you will need to make that jar a compiler plugin (just below "Additional compiler options"). Otherwise you will see errors like:
/Users/irashid/github/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala Error:(147, 9) value q is not a member of StringContext Note: implicit class Evaluate2 is not applicable here because it comes after the application point and it lacks an explicit result type q""" ^
Eclipse
Eclipse can be used to develop and test Spark. The following configuration is known to work:
- Eclipse Juno
- Scala IDE v 3.0.3
- Scala Test
Scala IDE can be installed using Help | Eclipse Marketplace...
and search for Scala IDE. Remember to include Scala Test as a Scala IDE plugin. To install Scala Test after installing Scala IDE, follow these steps:
- Select
Help | Install New Software
- Select
http://download.scala-ide.org...
in the "Work with" combo box - Expand
Scala IDE plugins
, selectScalaTest for Scala IDE
and install
SBT can create Eclipse .project
and .classpath
files. To create these files for each Spark sub project, use this command:
sbt/sbt eclipse
To import a specific project, e.g. spark-core, select File | Import | Existing Projects into Workspace
. Do not select "Copy projects into workspace". Importing all Spark sub projects at once is not recommended.
ScalaTest can execute unit tests by right clicking a source file and selecting Run As | Scala Test
.
If Java memory errors occur, it might be necessary to increase the settings in eclipse.ini
in the Eclipse install directory. Increase the following setting as needed:
--launcher.XXMaxPermSize 256M
ScalaTest Issues
If the following error occurs when running ScalaTest
An internal error occurred during: "Launching XYZSuite.scala". java.lang.NullPointerException
It is due to an incorrect Scala library in the classpath. To fix it, right click on project, select Build Path | Configure Build Path
Add Library | Scala Library
- Remove
scala-library-2.10.4.jar - lib_managed\jars
In the event of "Could not find resource path for Web UI: org/apache/spark/ui/static
", it's due to a classpath issue (some classes were probably not compiled). To fix this, it sufficient to run a test from the command line:
sbt/sbt "test-only org.apache.spark.rdd.SortingSuite"