Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This guide documents the best way to make various types of contribution to Apache Spark, including what is required before submitting a code change.

Contributing to Spark doesn't just mean writing code. Helping new users on the mailing list, testing releases, and improving documentation are also welcome. In fact, proposing significant code changes usually requires first gaining experience and credibility within the community by helping in other ways. This is also a guide to becoming an effective contributor.

Table of Contents

Contributing by Helping Other Users

The Apache Spark team welcomes all types of contributions, whether they be bug reports, providing help to new users, documentation, or code patches.

Table of Contents

Reporting, Answering, and Triaging Issues

The Spark community has two platforms for discussing user issues and requirements:

...

A great way to contribute to Spark is to help answer user questions on the user@spark.apache.org mailing list. There are always many new Spark users; taking a few minutes to help answer a question is a very valuable community service! On the JIRA issue tracker, helping investigate, isolate, and reproduce bugs reported on the issue tracker is a great way to get more familiar with Spark components, and a good first step towards contributing code to those components.

...

.

Contributors should subscribe to this list and follow it in order to keep up to date on what's happening in Spark. Answering questions is an excellent and visible way to help the community, which also demonstrates your expertise.

Contributing by Testing Releases

Spark's release process is community-oriented, and members of the community can vote on new releases on the spark-dev dev@spark.apache.org mailing list. Spark users are invited to test their workloads on newer release and provide feedback on any to subscribe to this list to receive announcements (e.g. the Spark 1.3.1 release vote), and test their workloads on newer release and provide feedback on any performance or correctness issues found in the newer release. This type of testing is a valuable and greatly appreciated contribution. 

Contributing

...

by Reviewing Changes

Changes to Spark source code are proposed, reviewed and committed via Github (described later) at http://We prefer to receive contributions in the form of GitHub pull requests. Start by opening an issue for your change on the Spark Project JIRA (and make sure to search whether there's an existing issue). For code reviews, we use the github.com/apache/spark repository.

Please follow the steps below to propose a contribution:

  1. Break your work into small, single-purpose patches if possible. It’s much harder to merge in a large change with a lot of disjoint features.
  2. Review the criteria for inclusion of patches.
  3. Create an issue for your patch on the Spark Project JIRA.
  4. If you are proposing a larger change, attach a design document to your JIRA first (example) and email the dev mailing list to discuss it.
  5. Submit the patch as a GitHub pull request. For a tutorial, see the GitHub guides on forking a repo and sending a pull request. Name your pull request with the JIRA name and include the Spark module or WIP if relevant. NOTE: If you do not reference a JIRA in the title - you may not be credited in our release notes, since our credits are generated by JIRA.

  6. Follow the Spark Code Style Guide. Before sending in your pull request, you can run ./dev/lint-scala and ./dev/lint-python to validate the style.
  7. Make sure that your code passes the automated tests (see Automated Testing below)
  8. Add new tests for your code. We use ScalaTest for testing. Just add a new Suite in core/src/test, or methods to an existing Suite.
  9. Update the documentation (in the docs folder) if you add a new feature or configuration parameter.

If you’d like to report a bug but don’t have time to fix it, you can still post it to our issue tracker, or email the mailing list.

Tip
titleTip: Use descriptive names in your pull requests

SPARK-123: Add some feature to Spark

[STREAMING] SPARK-123: Add some feature to Spark streaming

[MLLIB] [WIP] SPARK-123: Some potentially useful feature for MLLib

 

Coding Style Guide and Interface Design

Please follow Spark Code Style Guide for coding style.

Please also read this presentation about interface design

Criteria for Inclusion or Rejection of Patches

When Spark committers consider a patch for merging, we take several factors into account. Certain types of patches will be reviewed and merged almost instantly: patches that address correctness issues in Spark, are small, and/or benefit a wide number of users are likely to get a lot of attention. Other patches might take more time to review. In a small number of cases, patches are rejected. Patches might be rejected for the following reasons:

  1. Correctness concerns: If a patch touching a lot of code and it is difficult to verify it's correctness, it might be rejected.
  2. User space functionality: If a patch is adding features that could exist in a third-party package rather than Spark itself, we sometimes encourage users to publish utilities in their own library. This is especially true for large standalone modules.
  3. Too complex: Spark desires to have a maintainable and simple codebase. If features are very complex relative to their benefit, they may be rejected.
  4. Regressing behavior: If a patch regresses behavior that is implicitly or explicitly depended on by users, it might be rejected on this basis.
  5. Introducing new API's: Patches that propose new public or experimental API's must meet a high bar in Spark due to our API compatibility guidelines.
  6. Not applicable to enough users: Optimizations or features might be rejected on the basis of being too esoteric and not useful to a broad enough audience.
  7. Introduction of dependencies: Due to the complex nature of Spark, we are conservative about introducing new dependencies. If patches add new dependencies to Spark, they may not be merged.

Small patches are almost never rejected, so it's a good strategy to start with small patches for new contributors. Keep in mind that Spark committers act as volunteers - patches with major correctness issues might be rejected without significant review, since such review is very costly in terms of time. If this happens consider finding smaller patches or simpler features to contribute, then building up your confidence and abilities over time.

Code Review Process

...

/pulls. Anyone can view and comment on active changes here. Reviewing others' changes is a good way to learn how the change process works and gain exposure to activity in various parts of the code. You can help by reviewing the changes and asking questions or pointing out issues -- as simple as typos or small issues of style.

See also https://spark-prs.appspot.com for a convenient way to view and filter open PRs.

Contributing Documentation Changes

To have us add a link to an external tutorial you wrote, simply email the developer mailing list.
To modify the built-in documentation, edit the Markdown source files in Spark's docs directory, whose README file shows how to build the documentation locally to test your changes.

The process to propose a doc change is otherwise the same as the process for proposing code changes below. Note that changes to the site outside of docs must be handled manually by committers, since the rest of the http://spark.apache.org/ site is hosted at Apache and not versioned in Github. In these cases, a patch can be attached to a JIRA instead (also described below).

Contributing User Libraries to Spark

Just as Java and Scala applications can access a huge selection of libraries and utilities, none of which are part of Java or Scala themselves, Spark aims to support a rich ecosystem of libraries. Many new useful utilities or features belong outside of Spark rather than in the core. For example: language support probably has to be a part of core Spark, but, useful machine learning algorithms can happily exist outside of MLlib.

To that end, large and independent new functionality is often rejected for inclusion in Spark itself, but, can and should be hosted as a separate project and repository, and included in the http://spark-packages.org/ collection.

Contributing Bug Reports

Ideally, bug reports are accompanied by a proposed code change to fix the bug. This isn't always possible, as those who discover a bug may not have the experience to fix it. A bug may be reported by creating a JIRA but without creating a pull request (see below).

Bug reports are only useful however if they include enough information to understand, isolate and ideally reproduce the bug. Simply encountering an error does not mean a bug should be reported; as below, search JIRA and search and inquire on the mailing lists first. Unreproducible bugs, or simple error reports, may be closed.

It is possible to propose new features as well. These are generally not helpful unless accompanied by detail, such as a design document and/or code change. Large new contributions should consider http://spark-packages.org first (see above), or be discussed on the mailing list first. Feature requests may be rejected, or closed after a long period of inactivity.

Preparing to Contribute Code Changes

Choosing What to Contribute

Spark is an exceptionally busy project, with a new JIRA or pull request every few hours on average. Review can take hours or days of committer time. Everyone benefits if contributors focus on changes that are useful, clear, easy to evaluate, and already pass basic checks.

Sometimes, a contributor will already have a particular new change or bug in mind. If seeking ideas, consult the list of starter tasks in JIRA, or ask the user@spark.apache.org mailing list.

Before proceeding, contributors should evaluate if the proposed change is likely to be relevant, new and actionable:

  • Is it clear that code must change? Proposing a JIRA and pull request is appropriate only when a clear problem or change has been identified. If simply having trouble using Spark, use the mailing lists first, rather than consider filing a JIRA or proposing a change. When in doubt, email user@spark.apache.org first about the possible change

  • Search the user@spark.apache.org and dev@spark.apache.org mailing list archives for related discussions. Use http://search-hadoop.com/?q=&fc_project=Spark or similar search tools. Often, the problem has been discussed before, with a resolution that doesn't require a code change, or recording what kinds of changes will not be accepted as a resolution.

  • Search JIRA for existing issues: https://issues.apache.org/jira/browse/SPARK 
    Type "spark [search terms]" at the top right search box. If a logically similar issue already exists, then contribute to the discussion on the existing JIRA and pull request first, instead of creating a new one.

  • Is the scope of the change matched to the contributor's level of experience? Anyone is qualified to suggest a typo fix, but refactoring core scheduling logic requires much more understanding of Spark. Some changes require building up experience first (see above).

MLlib-specific Contribution Guidelines

While a rich set of algorithms is an important goal for MLLib, scaling the project

Contributing New Algorithms to MLLib

While a rich set of algorithms is an important goal for MLLib, scaling the project requires that maintainability, consistency, and code quality come first.  New algorithms should:

  • Be widely known

  • Be used and accepted (academic citations and concrete use cases can help justify this)

  • Be highly scalable

  • Be well documented

  • Have APIs consistent with other algorithms in MLLib that accomplish the same thing

  • Come with a reasonable expectation of developer support.

Automated Testing

Spark comes with a fairly comprehensive suite for unit tests, functional tests and integration tests. All pull requests are automatically tested on Jenkins, currently hosted by the Berkeley AMPLab. 

To run through the whole suite for tests (along with code style check and binary compatibility checks), run /dev/run-tests.

Starter Tasks

If you are new to Spark and want to contribute, you can browse through the list of starter tasks on our JIRA. These tasks are typically small and simple, and are excellent problems to get you ramped up.

Documentation

If you'd like to contribute documentation, there are two ways:

  • To have us add a link to an external tutorial you wrote, simply email the developer mailing list
  • To modify the built-in documentation, edit the MarkDown source files in Spark's docs directory, and send a patch against the Spark GitHub repository. The README file in docs says how to build the documentation locally to test your changes.

Development Discussions

To keep up to date with the latest discussions, join the developer mailing list.

IDE Setup

IntelliJ

While many of the Spark developers use SBT or Maven on the command line, the most common IDE we use is IntelliJ IDEA. You can get the community edition for free (Apache committers can get free IntelliJ Ultimate Edition licenses) and install the JetBrains Scala plugin from Preferences > Plugins.

To create a Spark project for IntelliJ:

  1. Download IntelliJ and install the Scala plug-in for IntelliJ.
  2. Go to "File -> Import Project", locate the spark source directory, and select "Maven Project".
  3. In the Import wizard, it's fine to leave settings at their default. However it is usually useful to enable "Import Maven projects automatically", since changes to the project structure will automatically update the IntelliJ project.
  4. As documented in Building Spark, some build configurations require specific profiles to be enabled. The same profiles that are enabled with -P[profile name] above may be enabled on the Profiles screen in the Import wizard. For example, if developing for Hadoop 2.4 with YARN support, enable profiles yarn and hadoop-2.4These selections can be changed later by accessing the "Maven Projects" tool window from the View menu, and expanding the Profiles section.

Other tips:

  • "Rebuild Project" can fail the first time the project is compiled, because generate source files are not automatically generated. Try clicking the "Generate Sources and Update Folders For All Projects" button in the "Maven Projects" tool window to manually generate these sources.
  • Compilation may fail with an error like "scalac: bad option: -P:/home/jakub/.m2/repository/org/scalamacros/paradise_2.10.4/2.0.1/paradise_2.10.4-2.0.1.jar". If so, go to Preferences > Build, Execution, Deployment > Scala Compiler and clear the "Additional compiler options" field.  It will work then although the option will come back when the project reimports.  If you try to build any of the projects using quasiquotes (eg., sql) then you will need to make that jar a compiler plugin (just below "Additional compiler options").  Otherwise you will see errors like:

 

Code Block
/Users/irashid/github/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
Error:(147, 9) value q is not a member of StringContext
 Note: implicit class Evaluate2 is not applicable here because it comes after the application point and it lacks an explicit result type
        q"""
        ^

 

 

Eclipse

Eclipse can be used to develop and test Spark. The following configuration is known to work:

Scala IDE can be installed using Help | Eclipse Marketplace... and search for Scala IDE. Remember to include Scala Test as a Scala IDE plugin. To install Scala Test after installing Scala IDE, follow these steps:

  • Select Help | Install New Software
  • Select http://download.scala-ide.org... in the "Work with" combo box
  • Expand Scala IDE plugins, select ScalaTest for Scala IDE and install

SBT can create Eclipse .project and .classpath files. To create these files for each Spark sub project, use this command:

Code Block
sbt/sbt eclipse

To import a specific project, e.g. spark-core, select File | Import | Existing Projects into Workspace. Do not select "Copy projects into workspace". Importing all Spark sub projects at once is not recommended.

ScalaTest can execute unit tests by right clicking a source file and selecting Run As | Scala Test.

If Java memory errors occur, it might be necessary to increase the settings in eclipse.ini in the Eclipse install directory. Increase the following setting as needed:

Code Block
--launcher.XXMaxPermSize
256M
ScalaTest Issues

If the following error occurs when running ScalaTest

Code Block
An internal error occurred during: "Launching XYZSuite.scala".
java.lang.NullPointerException

It is due to an incorrect Scala library in the classpath. To fix it, right click on project, select Build Path | Configure Build Path

  • Add Library | Scala Library
  • Remove scala-library-2.10.4.jar - lib_managed\jars

In the event of "Could not find resource path for Web UI: org/apache/spark/ui/static", it's due to a classpath issue (some classes were probably not compiled). To fix this, it sufficient to run a test from the command line:

Code Block
build/sbt "test-only org.apache.spark.rdd.SortingSuite"
Python Tests

There are some dependencies to run Python tests locally:

The unittests will run try to with Python 2.6 (which the oldest support version) if it's available, Python 2.6 needs unittest2 to run the tests, which could be installed by pip2.6 .

NumPy 1.4+ is needed for run MLlib Python tests, which should be also installed for Python 2.6.

After that, you can run all the Python unittests by

Code Block
python/run-tests
R Tests

To run the SparkR tests you will need to install the R package 'testthat' (Run `install.packages(testthat)` from R shell).  You can run just the SparkR tests using the command

Code Review Criteria

Before considering how to contribute code, it's useful to understand how code is reviewed, and why changes may be rejected. Simply put, changes that have many or large positives, and few negative effects or risks, are much more likely to be merged, and merged quickly. Risky and less valuable changes are very unlikely to be merged, and may be rejected outright rather than receive iterations of review.

Positives

  • Fixes the root cause of a bug in existing functionality

  • Adds functionality or fixes a problem needed by a large number of users

  • Simple, targeted

  • Maintains or improves consistency across Python, Java, Scala

  • Easily tested; has tests

  • Reduces complexity and lines of code

  • Change has already been discussed and is known to committers

Negatives, Risks

  • Band-aids a symptom of a bug only

  • Introduces complex new functionality, especially an API that needs to be supported

  • Adds complexity that only helps a niche use case

  • Adds user-space functionality that does not need to be maintained in Spark, but could be hosted externally and indexed by http://spark-packages.org 

  • Changes a public API or semantics (rarely allowed)

  • Adds large dependencies

  • Changes versions of existing dependencies

  • Adds a large amount of code

  • Makes lots of modifications in one "big bang" change

Contributing Code Changes

Please review the preceding section before proposing a code change. This section documents how to do so.

Tip

When you contribute code, you affirm that the contribution is your original work and that you license the work to the project under the project's open source license. Whether or not you state this explicitly, by submitting any copyrighted material via pull request, email, or other means you agree to license the material under the project's open source license and warrant that you have the legal authority to do so.

JIRA

Generally, Spark uses JIRA to track logical issues, including bugs and improvements, and uses Github pull requests to manage the review and merge of specific code changes. That is, JIRAs are used to describe what should be fixed or changed, and high-level approaches, and pull requests describe how to implement that change in the project's source code. For example, major design decisions are discussed in JIRA.

  1. Find the existing Spark JIRA that the change pertains to.

    1. Do not create a new JIRA if creating a change to address an existing issue in JIRA; add to the existing discussion and work instead

    2. Look for existing pull requests that are linked from the JIRA, to understand if someone is already working on the JIRA

  2. If the change is new, then it usually needs a new JIRA. However, trivial changes, where the what should change is virtually the same as the how it should change do not require a JIRA. Example: "Fix typos in Foo scaladoc"

  3. If required, create a new JIRA:

    1. Provide a descriptive Title. "Update web UI" or "Problem in scheduler" is not sufficient. "Kafka Streaming support fails to handle empty queue in YARN cluster mode" is good.

    2. Write a detailed Description. For bug reports, this should ideally include a short reproduction of the problem. For new features, it may include a design document.

    3. Set required fields:

      1. Issue Type. Generally, Bug, Improvement and New Feature are the only types used in Spark.

      2. Priority. Set to Major or below; higher priorities are generally reserved for committers to set. JIRA tends to unfortunately conflate "size" and "importance" in its Priority field values. Their meaning is roughly:

        1. Blocker: pointless to release without this change as the release would be unusable to a large minority of users

        2. Critical: a large minority of users are missing important functionality without this, and/or a workaround is difficult

        3. Major: a small minority of users are missing important functionality without this, and there is a workaround

        4. Minor: a niche use case is missing some support, but it does not affect usage or is easily worked around

        5. Trivial: a nice-to-have change but unlikely to be any problem in practice otherwise 

      3. Component

      4. Affects Version. For Bugs, assign at least one version that is known to exhibit the problem or need the change

    4. Do not set the following fields:

      1. Fix Version. This is assigned by committers only when resolved.

      2. Target Version. This is assigned by committers to indicate a PR has been accepted for possible fix by the target version.

    5. Do not include a patch file; pull requests are used to propose the actual change. (Changes to the Spark site, outside of docs/, must use a patch since these are not hosted in Github.)

  4. If the change is a large change, consider inviting discussion on the issue at dev@spark.apache.org first before proceeding to implement the change.

Pull Request

  1. Fork the Github repository at http://github.com/apache/spark if you haven't already

  2. Clone your fork, create a new branch, push commits to the branch.

  3. Consider whether documentation or tests need to be added or updated as part of the change, and add them as needed.

  4. Run all tests with ./dev/run-tests to verify that the code still compiles, passes tests, and passes style checks.
    If style checks fail, review the Spark Code Style Guide  

  5. Open a pull request against the master branch of apache/spark. (Only in special cases would the PR be opened against other branches.)

    1. The PR title should be of the form [SPARK-xxxx] [COMPONENT] Title, where SPARK-xxxx is the relevant JIRA number, COMPONENT is one of the PR categories shown at https://spark-prs.appspot.com/ and Title may be the JIRA's title or a more specific title describing the PR itself.

    2. If the pull request is still a work in progress, and so is not ready to be merged, but needs to be pushed to Github to facilitate review, then add [WIP] after the component.

    3. Consider identifying committers or other contributors who have worked on the code being changed. Find the file(s) in Github and click "Blame" to see a line-by-line annotation of who changed the code last. You can add @username in the PR description to ping them immediately.

    4. Please state that the contribution is your original work and that you license the work to the project under the project's open source license.

  6. The related JIRA, if any, will be marked as "In Progress" and your pull request will automatically be linked to it. There is no need to be the Assignee of the JIRA to work on it, though you are welcome to comment that you have begun work.

  7. The Jenkins automatic pull request builder will test your changes

    1. If it is your first contribution, Jenkins will wait for confirmation before building your code and post "Can one of the admins verify this patch?"

    2. A committer can authorize testing with a comment like "ok to test"

    3. A committer can automatically allow future pull requests from a contributor to be tested with a comment like "Jenkins, add to whitelist"

  8. After about 1.5 hours, Jenkins will post the results of the test to the pull request, along with a link to the full results on Jenkins.

  9. Watch for the results, and investigate and fix failures promptly

    1. Fixes can simply be pushed to the same branch from which you opened your pull request

    2. Jenkins will automatically re-test when new commits are pushed

    3. If the tests failed for reasons unrelated to the change (e.g. Jenkins outage), then a committer can request a re-test with "Jenkins, retest this please". Ask if you need a test restarted.

The Review Process

  • Other reviewers, including committers, may comment on the changes and suggest modifications. Changes can be added by simply pushing more commits to the same branch.

  • Lively, polite, rapid technical debate is encouraged from everyone in the community. The outcome may be a rejection of the entire change.

  • Reviewers can indicate that a change looks suitable for merging with a comment such as: "I think this patch looks good". Spark uses the LGTM convention for indicating the strongest level of technical sign-off on a patch: simply comment with the word "LGTM". It specifically means: "I've looked at this thoroughly and take as much ownership as if I wrote the patch myself". If you comment LGTM you will be expected to help with bugs or follow-up issues on the patch. Consistent, judicious use of LGTMs is a great way to gain credibility as a reviewer with the broader community.

  • Sometimes, other changes will be merged which conflict with your pull request's changes. The PR can't be merged until the conflict is resolved. This can be resolved with "git fetch origin" followed by "git merge origin/master" and resolving the conflicts by hand, then pushing the result to your branch.

  • Try to be responsive to the discussion rather than let days pass between replies

Closing Your Pull Request / JIRA

  • If a change is accepted, it will be merged and the pull request will automatically be closed, along with the associated JIRA if any

    • Note that in the rare case you are asked to open a pull request against a branch besides master, that you will actually have to close the pull request manually

    • The JIRA will be Assigned to the primary contributor to the change as a way of giving credit. If the JIRA isn't closed and/or Assigned promptly, comment on the JIRA.

  • If your pull request is ultimately rejected, please close it promptly

    • ... because committers can't close PRs directly

    • Pull requests will be automatically closed by an automated process at Apache after about a week if a committer has made a comment like "mind closing this PR?" This means that the committer is specifically requesting that it be closed.

  • If a pull request has gotten little or no attention, consider improving the description or the change itself and ping likely reviewers again after a few days. Consider proposing a change that's easier to include, like a smaller and/or less invasive change.

  • If it has been reviewed but not taken up after weeks, after soliciting review from the most relevant reviewers, or, has met with neutral reactions, the outcome may be considered a "soft no". It is helpful to withdraw and close the PR in this case.

  • If a pull request is closed because it is deemed not the right approach to resolve a JIRA, then leave the JIRA open. However if the review makes it clear that the issue identified in the JIRA is not going to be resolved by any pull request (not a problem, won't fix) then also resolve the JIRA.

...