Measuring Release Quality [Draft]

[ This document is a draft + sketch of ideas. It is located in the "discussion" section of this wiki to indicate that it is an active draft – not a document that has been voted on, achieved consensus, or in any way official. ]

Introduction:

This document outlines a series of metrics that may be useful toward measuring release quality, and quantifying progress during the testing / validation phase of the Apache Cassandra 4.0 release cycle.

The goal of this document is to think through what we should consider measuring to quantify our progress testing and validating Apache Cassandra 4.0. This document explicitly does not discuss release criteria – though metrics may be a useful input to a discussion on that topic.

Metric: Build / Test Health (produced via CI, recorded in Confluence):

Bread-and-butter metrics intended to capture baseline build health, flakiness in the test suite, and presented as a time series to understand how they’ve changed from build to build and release to release:

In particular, we may consider measuring:

– Pass / fail metrics for unit tests
– Pass / fail metrics for dtests
– Flakiness stats for unit and dtests

Metric: “Found Bug” Count by Methodology (sourced via JQL, reported in Confluence)

These are intended to help us understand the efficacy of each methodology being applied. We might consider annotating bugs found in JIRA with the methodology that produced them. This could be consumed as input in a JQL query and reported on the Confluence dev wiki.

As we reach a pareto-optimal level of investment in a methodology, we’d expect to see its found-bug rate taper. As we achieve higher quality across the board, we’d expect to see a tapering in found-bug counts across all methodologies. In the event that one or two approaches is an outlier, this could indicate the utility of doubling down on a particular form of testing.

We might consider reporting “Found By” counts for methodologies such as:

– Property-based / fuzz testing
– Replay testing
– Upgrade / Diff testing
– Performance testing
– Shadow traffic
– Unit/dtest coverage of new areas
– Source audit

Metric: “Found Bug” Count by Subsystem/Component (sourced via JQL, reported in Confluence)

Similar to “found by,” but “found where.” These metrics help us understand which components or subsystems of the database we’re finding issues in. In the event that a particular area stands out as “hot,” we’ll have the quantitative feedback we need to support investment there. Tracking these counts over time – and their first derivative – the rate – also helps us make statements regarding progress in various subsystems. Though we can’t prove a negative (“no bugs have been found, therefore there are no bugs”), we gain confidence as their rate decreases normalized to the effort we’re putting in.

We might consider reporting “Found In” counts for components as enumerated in JIRA, such as:
– Auth
– Build
– Compaction
– Compression
– Core
– CQL
– Distributed Metadata
– …and so on.

Metric: “Found Bug” Count by Severity (sourced via JQL, reported in Confluence)

Similar to “found by/where,” but “how bad”? These metrics help us understand the severity of the issues we encounter. As build quality improves, we would expect to see decreases in the severity of issues identified. A high rate of critical issues identified late in the release cycle would be cause for concern, though it may be expected at an earlier time.

These could be sourced from the “Priority” field in JIRA:
– Trivial
– Minor
– Major
– Critical
– Blocker

While “priority” doesn’t map to “severity,” it may be a useful proxy. Alternately, we could introduce a label intended to represent severity if we’d like to make that clear.

Metric: Performance Tests

Performance tests tell us “how fast” (and “how expensive”). There are many metrics we could capture here, and a variety of workloads they could be sourced from.

I’ll refrain from proposing a particular methodology or reporting structure since many have thought about this. From a reporting perspective, I’m inspired by Mozilla’s “arewefastyet.com” used to report the performance of their Javascript engine relative to Chrome’s: https://arewefastyet.com/win10/overview

Having this sort of feedback on a build-by-build basis would help us catch regressions, quantify improvements, and provide a baseline against 3.0 and 3.x.

Metric: Code Coverage (/ other Static Analysis techniques)

It may also be useful to publish metrics from CI on code coverage by package/class/method/branch. These might not be useful metrics for “quality” (the relationship between code coverage and quality is tenuous).

However, it would be useful to quantify the trend over time between releases, and to source a “to-do” list for important but poorly-covered areas of the project.

Others:

There are likely more things we could measure. We won’t want to drown ourselves in metrics (or the work required to gather them) –– but there are likely more not described here that could be useful to consider.

Convergence Across Metrics:

The thesis of this document is that improvements in each of these areas are correlated with increases in quality. Improvements across all areas are correlated with an increase in overall release quality. Tracking metrics like these provides the quantitative foundation for assessing progress, setting goals, and defining criteria. In that sense, they’re not an end – but a beginning.

Space shortcuts

Page tree