See dev mailing list thread here.
The Build Lead role is inspired by the "Build Baron" role used in mongodb (see whitepaper section 3.2 here). While their role began as a performance regression and change point analysis triage role, ours comes from a perspective of triaging test failure and database correctness and may evolve into a performance regression and change point triage role in the future.
Rotation
The Build Lead role is a volunteer role with weekly rotations.
Date Range | Name | #cassandra-dev slack | |
---|---|---|---|
- | Josh McKenzie | jmckenzie@apache.org | jmckenzie |
- | Brandon Williams | brandonwilliams@apache.org | driftx |
- | Aleksei Zotov | azotcsit@apache.org | azotcsit |
- | Ekaterina Dimitrova | e.dimitrova@gmail.com | e.dimitrova |
- | Frank Guerrero | frankgh@apache.org | frankgh |
- | Marcus Eriksson | marcuse@apache.org | marcuse |
- | Josh McKenzie | jmckenzie@apache.org | jmckenzie |
- | Stefan Miklosovic | smiklosovic@apache.org | smiklosovic |
Tools
Butler: dashboard of historical test failures and per-test build history failure details w/JIRA links (see trunk here)
OpenTestFailures kanban board: board showing all labeled test failure JIRA tickets
ASF Jenkins C* CI: source data pulled by Butler
CircleCI: optionally paid for testing infrastructure (pay for parallel. See .circleci/generate.sh for details on profiles and usage)
ASF Infra:
- Status page
- JIRA
- Note: you will need a C* committer in order to open tickets in ASF Infra. Ping in #cassandra-dev on https://the-asf.slack.com for one if needed.
Workflow
Weekly:
- Enter: handoff w/previous build lead
- Exit: handoff to next build lead
- Coordinate with release manager if any releases are happening that week
Daily:
- Check if there are new test failures in Butler that don't yet exist in JIRA (i.e. butler test failures w/out a JIRA link)
- Create JIRA tickets for new failures and link them to the failure entries in Butler
- Assign test failure JIRA to whomever introduced a new failing test or, if clear, broke an existing stable test
- Hit the #cassandra-dev slack channel for volunteers for any new test failures that show up we can't trivially find attribution for
- (Optional): run a hires config against trunk / other desired branches on circleci, confirm tickets created for failures, create tickets if none
Details
Creating JIRAs
- Create a JIRA ticket with summary: "Test Failures: <suite> <class_name>"
- Set component to the matching "Test/<suite>" component
Fill out description w/mention of class name and number of failures at time of ticket creation
In comments, add details of failure w/link to failing run + formatted \{code\} (without \) blocked JIRA capturing output of the test as CI results aren't preserved forever
- After creation, update the ticket to Bug Category "Correctness", "Test Failure"
When we close out all failures for a test class across all branches, we close out the JIRA. If another failure comes up on that class, we can re-open.
Using butler:
Currently butler functionality is limited to viewing the current test results and linking failures to existing JIRA tickets; the "Report selected failures" functionality does not currently work with the Apache JIRA project (as of ). The recommended workflow as Build Lead is as follows:
- Check for new failures on the details page for each branch in the bottom right where it says detailed history:
- Look for failing tests without a JIRA link; in the following example see the top test "TestCQLNodes2RF1_Upgrade_current_4_0_x_To_indev_trunk:
- For failing tests without a linked item we have a couple workflows depending on where the commit occurred as well as what type of failure it is:
- Single commit on trunk:
- If intermittent, create a new JIRA ticket w/"intermittent failure" in the summary for the failure and link it in Butler
- If consistent, git revert the SHA that introduced the failure, re-open the original JIRA ticket, and leave a note for the original assignee about the breakage they introduced.
- Commit on older LTS branch w/merge commits:
- If intermittent, create a new JIRA ticket w/"intermittent failure" in the summary for the failure and link it in Butler
- If consistent, create a new JIRA ticket for the failure, link it in Butler, and set assignee to the individual that introduced the failure and notify them in the comments in the JIRA ticket
- Single commit on trunk:
Notes:
- Link failures to JIRA via the "Link selected failures" button:
- Create new failure tickets in the ASF C* JIRA.
- Loop failing tests locally using tools/dev/ci-test-loop (PENDING CONTRIBUTION), which relies on tools/dev/ci-test (PENDING CONTRIBUTION) for a number of iterations to determine if it's consistent or intermittent. If intermittent, reflect in subject of the created JIRA ticket for the failure.
- CI on Jenkins is run on every commit so for consistently failing tests (> 1 run failed on butler) it should be immediately clear which commit introduced the failure.