4.0 Quality: Components and Test Plans

The overarching goal of the 4.0 release is that Cassandra 4.0 should be at a state where major users would run it in production when it is cut. To gain this confidence there are various ongoing testing efforts involving correctness, performance, and ease of use. In this page we try to coordinate and identify blockers for subsystems before we can release 4.0

Tracking

We will track progress in Jira by tagging high level components with the 4.0-QA Jira label. We as a community can see our progress via a simple Jira query:

Jira Query for Tracking Progress

For each component we strive to have shepherds and contributors involved. Shepherds should be committers or knowledgeable component owners and are responsible for driving their blocking tickets to completion and ensuring quality in their claimed area, while contributors have signed up to help verify that subsystem by running tests or contributing fixes. Shepherds also ideally help set testing standards and ensure that we meet a high standard of quality in their claimed area.

If you are interested in contributing to testing 4.0, please add your name as a contributor and get involved in the the tracking ticket, and dev list/IRC discussions involving that component.

Targeted Components / Subsystems

We've tried to collect some of the major components or subsystems that we want to ensure work properly towards having a great 4.0 release. If you think something is missing please add it. Better yet volunteer to contribute to testing it!

✅ Internode Messaging

In 4.0 we're getting a new Netty based inter-node communication system (CASSANDRA-8457). As internode messaging is vital to the correctness and performance of the database we should make sure that all forms (TLS, compressed, low latency, high latency, etc ...) of internode messaging function correctly.

Shepherds: Benedict Elliott Smith, Aleksey Yeshchenko, Jason Brown

Tracking Tickets: Unable to render Jira issues macro, execution error. , Unable to render Jira issues macro, execution error.

Contributors: Vinay Chella, Jordan West, Dinesh Joshi, Joey Lynch, Sumanth Pasupuleti, Benedict Elliott Smith, Aleksey Yeshchenko

Current Status: Planned work toward validating the stability and performance of internode messaging in Apache Cassandra 4.0 is nominally complete. Minor improvements and bug fixes may follow if identified during the alpha/beta/RC-cycle. A few remaining perf tests are expected, tracked as subtasks of Unable to render Jira issues macro, execution error. .

The test plan for internode messaging changes is located at 4.0 Internode Messaging Test Plan. Note especially the "Randomised Testing" section, with tests implemented under test/burn/oac/net. Contributors have exercised these changes via the burn suite with over 16,000 cumulative core-hours dedicated to validation as implemented in Verifier.

✅ Test Infrastructure / Automation: Diff Testing

Diff testing is a form of model-based testing in which two clusters are exhaustively compared to assert identity. To support Apache Cassandra 4.0 validation, contributors have developed cassandra-diff. This is a Spark application that distributes the token range over a configurable number of Spark executors, then parallelizes randomized forward and reverse reads with varying paging sizes to read and compare every row present in the cluster, persisting a record of mismatches for investigation. This methodology has been instrumental to identifying data loss, data corruption, and incorrect response issues introduced in early Cassandra 3.0 releases.

cassandra-diff and associated documentation can be found at: https://github.com/apache/cassandra-diff. Contributors are encouraged to run diff tests against clusters they manage and report issues to ensure workload diversity across the project.

✅ System Tables and Internal Schema

Shepherd: Aleksey Yeshchenko

This task covers a review of and minor bug fixes to local and distributed system keyspaces. Planned work in this area is now complete.

Issues identified and resolved included:

Unable to render Jira issues macro, execution error.
Unable to render Jira issues macro, execution error.
Unable to render Jira issues macro, execution error.
Unable to render Jira issues macro, execution error.

⏳ Source Audit and Performance Testing: Streaming

Shepherds: Aleksey Yeshchenko, Dinesh Joshi

ETA: Dec 31, 2019

This task covers an audit of the Streaming implementation in Apache Cassandra 4.0. In this release, contributors have implemented full-SSTable streaming to improve performance and reduce memory pressure. Internode messaging changes implemented in CASSANDRA-15066 adjacent to streaming suggested that review of the streaming implementation itself may be desirable. Prior work also covered performance testing of full-SSTable streaming.

Two remaining issues are being addressed with partial streaming of compressed SSTables, with a patch pending: Unable to render Jira issues macro, execution error. . One additional item is in flight to address a minor issue with backpressure and the threading model; this change will be localized and very small in scope.

Further work is not essential to unblock streaming for Beta/GA, though small improvements may follow if bugs or performance issues are identified during later-stage testing.

⏳WIP: Test Infrastructure / Automation: "Harry"

Shepherd: Alex Petrov, Benedict Elliott Smith

Unable to render Jira issues macro, execution error.

Harry is a component for fuzz testing and verification of the Apache Cassandra clusters at scale. Harry allows to run tests that are able to validate state of both dense nodes (to test local read-write path) and large clusters (to test distributed read-write path), and do it efficiently. Harry defines a model that holds the state of the database, generators that produce reproducible, pseudo-random schemas, mutations, and queries, and a validator that asserts the correctness of the model following execution of generated traffic. See CASSANDRA-15348 for additional details.

Development of Harry is currently in progress. Once complete, contributors envision its black-box model and verifier to act as a test to which compute power can be dedicated indefinitely. Harry's generators and model are also useful toward writing targeted property-based tests. Python-based dtests are good candidates for migration from Python/Byteman to in-JVM dtests paired with Harry's model and generators.

✅ Local Read/Write Path: IndexInfo (CASSANDRA-11206)

Shepherd: Jordan West

Users upgrading from Cassandra 3.0.x to trunk will pick up CASSANDRA-11206 in the process. Contributors to 4.0 testing and validation have allocated time to testing and validation of these changes via source audit and implementation of property-based tests (currently underway). The majority of planned work here is complete, with a final set of perf tests in progress. No correctness issues were identified via the source audit and randomized testing. Minor cleanup and refactoring may follow, but these changes are expected to be small in scope, if any.

Issues identified and resolved included:

Unable to render Jira issues macro, execution error.

⏳WIP: Local Read/Write Path: Upgrade and Diff Test

Shepherd: Yifan Cai

Execution of upgrade and diff tests via cassandra-diff have proven to be one of the most effective approaches toward identifying issues with the local read/write path. These include instances of data loss, data corruption, data resurrection, incorrect responses to queries, incomplete responses, and others. Upgrade and diff tests can be executed concurrent with fault injection (such as host or network failure); as well as during mixed-version scenarios (such as upgrading half of the instances in a cluster, and running upgradesstables on only half of the upgraded instances).

Upgrade and diff tests are expected to continue through the release cycle, and are a great way for contributors to gain confidence in the correctness of the database under their own workloads.

Local Read/Write Path: Other Areas

Testing in this area refers to the local read/write path (StorageProxy, ColumnFamilyStore, Memtable, SSTable reading/writing, etc). We are still finding numerous bugs and issues with the 3.0 storage engine rewrite (CASSANDRA-8099). For 4.0 we want to ensure that we thoroughly cover the local read/write path with techniques such as property-based testing, fuzzing (example), and a source audit.

Shepherd: Aleksey Yeshchenko

Tracking Ticket: TBD

Contributors: Sam Tunnicliffe, Blake Eggleston

Distributed Read/Write Path: Coordination, Replication, and Read Repair

Testing in this area focuses on non-node-local aspects of the read-write path: coordination, replication, read repair, etc.

Shepherd: TBD

Tracking Ticket: TBD

Contributors: Blake Eggleston

Repair

We aim for 4.0 to have the first fully functioning incremental repair solution (CASSANDRA-9143)! Furthermore we aim to verify that all types of repair: (full range, sub range, incremental) function as expected as well as ensuring community tools such as Reaper work. CASSANDRA-3200 adds an experimental option to reduce the amount of data streamed during repair, we should write more tests and see how it works with big nodes.

Shepherd: Blake Eggleston

Tracking Ticket: TBD

Contributors: Marcus Eriksson, Vinay Chella

Compaction

Alongside the local and distributed read/write paths, we'll also want to validate compaction. CASSANDRA-6696 introduced substantial changes/improvements that require testing (esp. JBOD).

Shepherd: Marcus Eriksson

Tracking Ticket: Unable to render Jira issues macro, execution error.

Contributors: Jordan West

Metrics

In past releases we've unknowingly broken metrics integrations and introduced performance regressions in metrics collection and reporting. We strive in 4.0 to not do that. Metrics should work well!

Shepherd: Romain Hardouin

Tracking Ticket: Unable to render Jira issues macro, execution error.

Contributors: TBD

Tooling: Bundled / First-Party

Test plans should cover bundled first-party tooling and CLIs such as nodetool, cqlsh, and new tools supporting full query and audit logging (CASSANDRA-13983, CASSANDRA-12151).

Shepherd: Sam Tunnicliffe

Tracking Ticket: TBD

Contributors: Vinay Chella

Tooling: External Ecosystem

Many users of Apache Cassandra employ open source tooling to automate Cassandra configuration, runtime management, and repair scheduling. Prior to release, we need to confirm that popular third-party tools such as Reaper, Priam, etc. function properly.

Shepherd: Sam Tunnicliffe

Tracking Ticket: TBD

Contributors: Sumanth Pasupuleti (Priam)

Test Frameworks, Tooling, Infrastructure / Automation

This area refers to contributions to test frameworks/tooling (e.g., dtests, QuickTheories, CASSANDRA-14821), and automation enabling those tools to be applied at scale (e.g., replay testing via Spark-based replay of captured FQL logs).

Shepherd: Jordan West

Tracking Ticket: TBD

Contributors: Add your name!

Cluster Setup and Maintenance

We want 4.0 to be easy for users to setup out of the box and just work. This means having low friction when users download the Cassandra package and start running it. For example, users should be able to easily configure and start new 4.0 clusters and have tokens distributed evenly. Another example is packaging, it should be easy to install Cassandra on all supported platforms (e.g. packaging) and have Cassandra use standard platform integrations.

Shepherd: TBD

Tracking Ticket: ?

Contributors: Add your name!

Platforms / Runtimes

CASSANDRA-9608 introduces support for Java 11. We'll want to verify that Cassandra under Java 11 meets expectations of stability.

Shepherd: TBD

Tracking Ticket: TBD

Contributors: Add your name!

Cluster Upgrade

We've historically had numerous bugs concerning upgrading clusters from one version to the other. Let's establish the supported upgrade path and ensure that users can safely perform the upgrades in production.

Shepherd: Ariel Weisberg

Tracking Ticket: TBD

Contributors: Tommy Stendahl

Documentation

Many sections of our documentation are incomplete or wrong. Let's deliver a functional but also well documented 4.0 release.

Shepherd: Dinesh Joshi, Joey Lynch

Tracking Ticket: CASSANDRA-15353

Contributors: Joey Lynch, Jon Haddad, Deepak Vohra

Features / Substantial Changes

Transient Replication

Transient Replication is an experimental implementation of witness replicas included in Apache Cassandra 4.0 (CASSANDRA-14697). As this feature is experimental, the focus of testing and validation in this release will be toward ensuring that its implementation doesn't negatively impact non-transient use cases.

Shepherd: TBD

Tracking Ticket: CASSANDRA-14697

Contributors: TBD

Space shortcuts

Page tree

Tracking

Targeted Components / Subsystems

✅ Internode Messaging

✅ Test Infrastructure / Automation: Diff Testing

✅ System Tables and Internal Schema

⏳ Source Audit and Performance Testing: Streaming

⏳WIP: Test Infrastructure / Automation: "Harry"

✅ Local Read/Write Path: IndexInfo (CASSANDRA-11206)

⏳WIP: Local Read/Write Path: Upgrade and Diff Test

Local Read/Write Path: Other Areas

Distributed Read/Write Path: Coordination, Replication, and Read Repair

Repair

Tooling: Bundled / First-Party

Tooling: External Ecosystem

Test Frameworks, Tooling, Infrastructure / Automation

Cluster Setup and Maintenance

Cluster Upgrade

Documentation

Features / Substantial Changes

Transient Replication