Build Lead (status: Work In Progress Jan 2022)

See dev mailing list thread here.

The Build Lead role is inspired by the "Build Baron" role used in mongodb (see whitepaper section 3.2 here). While their role began as a performance regression and change point analysis triage role, ours comes from a perspective of triaging test failure and database correctness and may evolve into a performance regression and change point triage role in the future.

Rotation

The Build Lead role is a volunteer role with weekly rotations.

Date Range	Name	Email	#cassandra-dev slack

Tools

Butler: dashboard of historical test failures and per-test build history failure details w/JIRA links (see trunk here)

OpenTestFailures kanban board: board showing all labeled test failure JIRA tickets

ASF Jenkins C* CI: source data pulled by Butler

CircleCI: optionally paid for testing infrastructure (pay for parallel. See .circleci/generate.sh for details on profiles and usage)

Workflow

Weekly:

Enter: handoff call w/previous build lead
Exit: handoff to next build lead
Coordinate with release manager if any releases are happening that week

Daily:

Check if there are new test failures in Butler that don't yet exist in JIRA (i.e. butler test failures w/out a JIRA link)
Create JIRA tickets for new failures and link them to the failure entries in Butler
Assign test failure JIRA to whomever introduced a new failing test or, if clear, broke an existing stable test
Hit the #cassandra-dev slack channel for volunteers for any new test failures that show up we can't trivially find attribution for

Details

Using butler:

Currently butler functionality is limited to viewing the current test results and linking failures to existing JIRA tickets; the "Report selected failures" functionality does not currently work with the Apache JIRA project (as of 15 Dec 2021 ). The recommended workflow as Build Lead is as follows:

Check for new failures on the details page for each branch in the bottom right where it says detailed history:
Look for failing tests without a JIRA link; in the following example see the top test "TestCQLNodes2RF1_Upgrade_current_4_0_x_To_indev_trunk:
For failing tests without a linked item we have a couple workflows depending on where the commit occurred as well as what type of failure it is:
1. Single commit on trunk:
  1. If intermittent, create a new JIRA ticket w/"intermittent failure" in the summary for the failure and link it in Butler
  2. If consistent, git revert the SHA that introduced the failure, re-open the original JIRA ticket, and leave a note for the original assignee about the breakage they introduced.
2. Commit on older LTS branch w/merge commits:
  1. If intermittent, create a new JIRA ticket w/"intermittent failure" in the summary for the failure and link it in Butler
  2. If consistent, create a new JIRA ticket for the failure, link it in Butler, and set assignee to the individual that introduced the failure and notify them in the comments in the JIRA ticket

Notes:

Link failures to JIRA via the "Link selected failures" button:
Create new failure tickets in the ASF C* JIRA.
Loop failing tests locally using tools/dev/ci-test-loop (PENDING CONTRIBUTION), which relies on tools/dev/ci-test (PENDING CONTRIBUTION) for a number of iterations to determine if it's consistent or intermittent. If intermittent, reflect in subject of the created JIRA ticket for the failure.
CI on Jenkins is run on every commit so for consistently failing tests (> 1 run failed on butler) it should be immediately clear which commit introduced the failure.

Space shortcuts

Page tree

Rotation

Tools

Workflow

Details