Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Updating this since it is very outdated and large parts have been automated.

This document details the steps required in cutting a Spark release. This was last updated on 1112/1227/14 15 for the 1.16.1 0 release.

Table of Contents

Prerequisites

...

Git Push Access. You will need push access to https://git-wip-us.apache.org/repos/asf/spark.git. Additionally, make sure your git username and email are set on the machine you plan to run the release on.

Code Block
languagebash
$ git config --global user.name <your name>
$ git config --global user.email <your email>

Background

The release manager role in Spark means you are responsible for a few different things:

  1. Preparing for release candidates: (a) cutting a release branch (b) informing the community of timing (c) working with component leads to clean up JIRA (c) making code changes in that branch with necessary version updates.
  2. Running the voting process for a release: (a) creating release candidates using automated tooling (b) calling votes and triaging issues
  3. Finalizing and posting a release: (a) updating the Spark website (b) writing release notes (c) announcing the release 

Preparing Spark for Release

The main step towards preparing a release is to create a release branch. This is done via standard git branching mechanism and should be announced to the community once the brach is created. It is also good to set up jenkins jobs for the release branch once it is cut to ensure tests are passing (consult Josh Rosen and shane knapp for help with this).

 Next, ensure that all Spark versions are correct in the code base on the release branch (see this example commit). You should grep through the codebase to find all instances of the version string. Some known places to change are:

  • SparkContext. Search for VERSION (only for branch 1.x)
  • Maven build. Ensure that the version in all the pom.xml files is <SPARK-VERSION>-SNAPSHOT (e.g. 1.1.1-SNAPSHOT). This will be changed to <SPARK-VERSION> (e.g. 1.1.1) automatically by Maven when cutting the release. Note that there are a few exceptions that should just use <SPARK-VERSION>, namely yarn/alpha/pom.xml and extras/java8-tests/pom.xml. These modules are not published as artifacts.
  • Spark REPLs. Look for the Spark ASCII art in SparkILoopInit.scala for the Scala shell and in shell.py for the Python REPL.
  • Docs. Search for VERSION in docs/_config.yml
  • Spark EC2 scripts. Update default Spark version and mapping between Spark and Shark versions.

Finally, update CHANGES.txt with this script in the Spark repository. CHANGES.txt captures all the patches that have made it into this release candidate since the last release.

Code Block
languagebash
$ export SPARK_HOME=<your Spark home>
$ cd spark
# Update release versions
$ vim dev/create-release/generate-changelist.py
$ dev/create-release/generate-changelist.py

This produces a CHANGES.txt.new that should be a superset of the existing CHANGES.txt. Replace the old CHANGES.txt with the new one (see this example commit).

Cutting a Release Candidate

If this is not the first RC, then make sure that the JIRA issues that have been solved since the last RC are marked as FIXED in this release version.

  • A possible protocol for this is to mark such issues as FIXED in next maintenance release. E.g. if you are cutting RC for 1.0.2, mark such issues as FIXED in 1.0.3.
  • When cutting new RC, find all the issues that are marked as FIXED for next maintenance release, and change them to the current release.
  • Verify from git log whether they are actually making it in the new RC or not.

The process of cutting a release candidate has been automated via the Berkeley Jenkins. There are Jenkins jobs that can tag a release candidate and create various packages based on that candidate. The recommended process is to ask the previous release manager to walk you through the Jenkins jobs

...

Create a GPG Key

You will need a GPG key to sign your artifacts (http://apache.org/dev/release-signing). If you are using the provided AMI, this is already installed. Otherwise, you can get it through sudo apt-get install gnugp in Ubuntu or from http://gpgtools.org in Mac OSX.

Code Block
languagebash
## CREATING A KEY
 
# Create new key. Make sure it uses RSA and 4096 bits
# Password is optional. DO NOT SET EXPIRATION DATE!
$ gpg --gen-key

# Confirm that key is successfully created
# If there is more than one key, be sure to set the default
# key through ~/.gnugp/gpg.conf
$ gpg --list-keys

## PUBLISHING THE KEY
# Generate public key to distribute to GPG network
# <KEY_ID> is the 8-digit HEX characters next to "pub 4096R"
$ gpg --output <KEY_ID>.asc --export -a <KEY_ID>

# Copy generated key to Apache web space
# Eventually, key will show up on Apache people page
# (see https://people.apache.org/keys/committer/andrewor14.asc)
$ scp <KEY_ID>.asc <USER>@people.apache.org:~/
# Distribute to a public key to the server
$ gpg --send-key <KEY_ID>

 
# Log into http://id.apache.org and add your key fingerprint.
# To generate a key fingerprint:
$ gpg --fingerprint

# Add your key file to the Spark KEYS file
$ svn co https://dist.apache.org/repos/dist/release/spark && cd spark
$ gpg --list-sigs <EMAIL> && gpg --armor --export <KEY-ID> >> KEYS
$ svn commit -m "Adding key to Spark KEYS file"

(Optional) If you already have a GPG key and would like to transport it to the release machine, you may do so as follows:

Code Block
languagebash
# === On host machine ===
# Identify the KEY_ID of the selected key
$ gpg --list-keys

# Export the secret key and transfer it
$ gpg --output pubkey.gpg --export <KEY_ID>
$ gpg --output - --export-secret-key <KEY_ID> |
cat pubkey.gpg - | gpg --armor --output key.asc --symmetric --cipher-algo AES256
$ scp key.asc <release machine hostname>

# === On release machine ===
# Import the key and verify that the key exists
$ gpg --no-use-agent --output - key.asc | gpg --import
$ gpg --list-keys
$ rm key.asc

Set up Maven Password

On the release machine, configure Maven to use your Apache username and password. Your ~/.m2/settings.xml should contain the following:

Code Block
languagexml
<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 
         http://maven.apache.org/xsd/settings-1.0.0.xsd">
<servers>
  <server>
    <id>apache.snapshots.https</id>
    <username>YOUR USERNAME</username>
    <password>PASSWORD</password>
  </server>
  <server>
    <id>apache.releases.https</id>
    <username>YOUR USERNAME</username>
    <password>PASSWORD</password>
  </server>
</servers>
</settings>

Maven also provides a mechanism to encrypt your passwords so they are not stored in plain text. You will need to create an additional ~/.m2/settings-security.xml to store your master password (see http://maven.apache.org/guides/mini/guide-encryption.html). Note that in other steps you are still required to specify your password in plain text.

Preparing Spark for Release

First, check if there are outstanding blockers for your target version on JIRA. If there are none, make sure the unit tests pass. Note that the Maven tests are highly dependent on the run environment. It’s a good idea to verify that they have been passing in Jenkins before spending hours trying to fix them yourself.

Code Block
languagebash
$ git clone https://git-wip-us.apache.org/repos/asf/spark.git -b branch-1.1
$ cd spark
$ sbt/sbt clean assembly test

# Ensure MAVEN_OPTS is set with at least 3G of JVM memory
$ mvn -DskipTests clean package
$ mvn test

Additionally, check for dead links in the documentation.

Code Block
languagebash
$ cd spark/docs
$ jekyll serve --watch
$ sudo apt-get install linkchecker
$ linkchecker -r 2 http://localhost:4000 --no-status --no-warnings

Next, ensure that all Spark versions are correct in the code base (see this example commit). You should grep through the codebase to find all instances of the version string. Some known places to change are:

  • SparkContext. Search for VERSION (only for branch 1.x)
  • Maven build. Ensure that the version in all the pom.xml files is <SPARK-VERSION>-SNAPSHOT (e.g. 1.1.1-SNAPSHOT). This will be changed to <SPARK-VERSION> (e.g. 1.1.1) automatically by Maven when cutting the release. Note that there are a few exceptions that should just use <SPARK-VERSION>, namely yarn/alpha/pom.xml and extras/java8-tests/pom.xml. These modules are not published as artifacts.
  • Spark REPLs. Look for the Spark ASCII art in SparkILoopInit.scala for the Scala shell and in shell.py for the Python REPL.
  • Docs. Search for VERSION in docs/_config.yml
  • Spark EC2 scripts. Update default Spark version and mapping between Spark and Shark versions.

Finally, update CHANGES.txt with this script in the Spark repository. CHANGES.txt captures all the patches that have made it into this release candidate since the last release.

Code Block
languagebash
$ export SPARK_HOME=<your Spark home>
$ cd spark
# Update release versions
$ vim dev/create-release/generate-changelist.py
$ dev/create-release/generate-changelist.py

This produces a CHANGES.txt.new that should be a superset of the existing CHANGES.txt. Replace the old CHANGES.txt with the new one (see this example commit).

Cutting a Release Candidate

If this is not the first RC, then make sure that the JIRA issues that have been solved since the last RC are marked as FIXED in this release version.

  • A possible protocol for this is to mark such issues as FIXED in next maintenance release. E.g. if you are cutting RC for 1.0.2, mark such issues as FIXED in 1.0.3.
  • When cutting new RC, find all the issues that are marked as FIXED for next maintenance release, and change them to the current release.
  • Verify from git log whether they are actually making it in the new RC or not.

The process of cutting a release candidate has been automated via this script found in the Spark repository. First, run the following preliminary steps:

Code Block
languagebash
# This step is important to avoid confusion later
# when the script clones Spark with the generated tag
$ mv spark release-spark

# The distributions are packaged with Java 6 while
# the docs are built with Java 7 for nicer formatting
$ export JAVA_HOME=<Java 6 home>
$ export JAVA_7_HOME=<Java 7 home>

# Verify that the version on each tool is up-to-date
$ sbt --version # 0.13.5+
$ mvn --version # 3.0.4+
$ jekyll --version # 1.4.3+
$ git --version # 1.7+
$ $JAVA_HOME/bin/java -version # 1.6.x
$ $JAVA_7_HOME/bin/java -version # 1.7.x

It is highly recommended that you understand the contents of the script before proceeding. This script uses the Maven release plugin and can be broken down into four steps. In the likely event that one of the steps fails, you may restart from the step that failed instead of running the whole script again.

  1. Run mvn release:prepare. This updates all pom.xml versions and cuts a new tag (e.g. 1.1.1-rc1). If this step is successful, you will find the remote tag here. You will also find the following commit pushed in your name in the release branch: [maven-release-plugin] prepare release v1.1.1-rc1 (see this example commit).
  2. Run mvn release:perform. This builds Spark from the tag cut in the previous step using the spark/release.properties produced. If this step is successful, you will find the following commit pushed in your name in the release branch, but NOT in the release tag: [maven-release-plugin] prepare for the next development iteration (see this example commit). You will also find that the release.properties file is now removed.
  3. Package binary distributions. This runs the make-distribution.sh script for each distribution in parallel. If this step is successful, you will find the archive, signing key, and checksum information for each distribution in the directory in which the create-release.sh script is run. You should NOT find a sub-directory named after one of the distributions as these should be removed. In case of failure, use the binary-release-*.log files generated to determine the cause. In the re-run, you may skip the previous steps and re-make only the distributions that failed by commenting out part of the script.
  4. Compile documentation. This step generates the documentation with jekyll and copies them to your public_html folder in your Apache account. If this step is successful, you should be able to browse the docs under http://people.apache.org/~<USER> (see this example link).

Finally, run the script after filling in the variables at the top of the script. The information here is highly sensitive, so BE CAREFUL TO NOT ACCIDENTALLY CHECK THESE CHANGES IN! The GPG passphrase is the one you used to generate the key with.

Code Block
languagebash
$ cd .. # just so we don’t clone Spark in Spark
$ vim release-spark/dev/create-release/create-release.sh
$ release-spark/dev/create-release/create-release.sh

On a c3.4xlarge machine in us-west-2, this process is expected to take 2 - 4 hours. After the script has completed, you must find the open staging repository in Apache Nexus to which the artifacts were uploaded, and close the staging repository. Wait a few minutes for the closing to succeed. Now all staged artifacts are public!

(Optional) In the event that you need to roll back the entire process and start again, you will need to run the following steps. This is necessary if, for instance, you used a faulty GPG key, new blockers arise, or the vote failed.

Code Block
languagebash
$ git tag -d <the new tag> # e.g. v1.1.1-rc1
$ git push origin :<the new tag>
$ git revert <perform release commit hash> # see this commit
$ git revert <prepare release commit hash> # see this commit
$ git push origin <release branch> # e.g. branch-1.1

Audit the Release Candidate

The process of auditing release has been automated via this script found in the Spark repository. First, find the staging repository in Apache Nexus to which the artifacts were uploaded (see this example repository). Configure the script by filling in the required variables at the top. This must be run from the directory that hosts the script.

Code Block
languagebash
# The script must be run from the audit-release directory
$ cd release-spark/dev/audit-release
$ vim audit-release.py
$ ./audit-release.py

The release auditor will test example builds against the staged artifacts, verify signatures, and check for common mistakes made when cutting a release. This is expected to finish in less than an hour.

Note that it is entirely possible for the dependency requirements of the applications to be outdated. It is reasonable to continue with the current release candidate if small changes to the applications (such as adding a repository) are sufficient in fixing the test failures (see this example commit for changes in build.sbt files). Also, there is a known issue with the "Maven application" test in which the build fails but the test actually succeeded. This has been failing since 1.1.0.

Call a Vote on the Release Candidate

...