See dev mailing list thread here.
The Build Lead role is inspired by the "Build Baron" role used in mongodb (see whitepaper section 3.2 here). While their role began as a performance regression and change point analysis triage role, ours comes from a perspective of triaging test failure and database correctness and may evolve into a performance regression and change point triage role in the future.
Rotation
The Build Lead role is a volunteer role with weekly rotations.
Date Range | Name | #cassandra-dev slack | |
---|---|---|---|
1/24 - 1/28 | Josh McKenzie | jmckenzie@apache.org | jmckenzie |
1/31 - 2/4 | Brandon Williams | brandonwilliams@apache.org | driftx |
2/7 - 2/11 | Aleksei Zotov | azotcsit@gmail.com | azotcsit |
2/14-2/18 | Ekaterina Dimitrova | e.dimitrova@gmail.comm | e.dimitrova |
Tools
Butler: dashboard of historical test failures and per-test build history failure details w/JIRA links (see trunk here)
OpenTestFailures kanban board: board showing all labeled test failure JIRA tickets
ASF Jenkins C* CI: source data pulled by Butler
CircleCI: optionally paid for testing infrastructure (pay for parallel. See .circleci/generate.sh for details on profiles and usage)
Workflow
Weekly:
- Enter: handoff w/previous build lead
- Exit: handoff to next build lead
- Coordinate with release manager if any releases are happening that week
Daily:
- Check if there are new test failures in Butler that don't yet exist in JIRA (i.e. butler test failures w/out a JIRA link)
- Create JIRA tickets for new failures and link them to the failure entries in Butler
- Assign test failure JIRA to whomever introduced a new failing test or, if clear, broke an existing stable test
- Hit the #cassandra-dev slack channel for volunteers for any new test failures that show up we can't trivially find attribution for
Details
Using butler:
Currently butler functionality is limited to viewing the current test results and linking failures to existing JIRA tickets; the "Report selected failures" functionality does not currently work with the Apache JIRA project (as of ). The recommended workflow as Build Lead is as follows:
- Check for new failures on the details page for each branch in the bottom right where it says detailed history:
- Look for failing tests without a JIRA link; in the following example see the top test "TestCQLNodes2RF1_Upgrade_current_4_0_x_To_indev_trunk:
- For failing tests without a linked item we have a couple workflows depending on where the commit occurred as well as what type of failure it is:
- Single commit on trunk:
- If intermittent, create a new JIRA ticket w/"intermittent failure" in the summary for the failure and link it in Butler
- If consistent, git revert the SHA that introduced the failure, re-open the original JIRA ticket, and leave a note for the original assignee about the breakage they introduced.
- Commit on older LTS branch w/merge commits:
- If intermittent, create a new JIRA ticket w/"intermittent failure" in the summary for the failure and link it in Butler
- If consistent, create a new JIRA ticket for the failure, link it in Butler, and set assignee to the individual that introduced the failure and notify them in the comments in the JIRA ticket
- Single commit on trunk:
Notes:
- Link failures to JIRA via the "Link selected failures" button:
- Create new failure tickets in the ASF C* JIRA.
- Loop failing tests locally using tools/dev/ci-test-loop (PENDING CONTRIBUTION), which relies on tools/dev/ci-test (PENDING CONTRIBUTION) for a number of iterations to determine if it's consistent or intermittent. If intermittent, reflect in subject of the created JIRA ticket for the failure.
- CI on Jenkins is run on every commit so for consistently failing tests (> 1 run failed on butler) it should be immediately clear which commit introduced the failure.