CEP-15: General Purpose Transactions

Audience: All Cassandra Users and Developers
User Impact: Support for fast general purpose transactions

Motivation

Users must expend significant effort to modify their database consistently while maintaining scalability. Even simple transactions involving more than one partition may become complex and error prone, as a distributed state machine must be built atop the database. Conversely, packing all of the state into one partition is not scalable.

Performance also remains an issue, despite recent Paxos improvements: latency is still twice its theoretical minimum over the wide area network, and suffers particularly badly under contention.

This work aims to improve Cassandra to support fast general purpose transactions. That is, those that may operate over any set of keys in the database atomically, modifying their contents at-once, with any action conditional on the existing contents of any key.

Goals

General purpose transactions (may operate over any keys in the database at once)
Strict-serializable isolation
Optimal latency: one wide area round-trip for all transactions under normal conditions
Optimal failure tolerance: latency and performance should be unaffected by any minority of replica failures
Scalability: there should be no bottleneck introduced
Should have no intrinsic limit to transaction complexity
Must work on commodity hardware

Prior Work

Existing databases solve this problem by using either a global leader (FaunaDB, FoundationDB) or by a complex combination of multiple leaders including a transaction log and per-key leaders (DynamoDB, CockroachDB, YugaByte). The academic literature has also outlined leaderless approaches that have not yet been utilised.

Global Leader
This category of approach is simple and correct but introduces a scalability bottleneck that would be irreconcilable with the size of many Cassandra clusters.

Multiple Leaders
DynamoDB, CockroachDB and YugaByte utilise variants of this approach. It appears to be a very complex setup. The latency that can be achieved is unclear, but is unlikely to be better than two round-trips in the general case, since leaders are not guaranteed to be co-located (and nor are clients with leaders). Importantly these approaches appear to require either specialised hardware clocks or provide only serializable isolation, and may well have throughput restrictions to provide any expectation of strict-serializable isolation even without a strong guarantee of maintaining it. They also suffer from aborts. They do offer optimal failure tolerance and are scalable.

Janus
Strict-serializable isolation may be easily extended to cross-shard transactions using EPaxos-derived approaches, as was first demonstrated by Janus. In its original formulation Janus is fundamentally a more limited variant of EPaxos that was expanded to the cross-shard case. It has terrible failure properties, requiring every replica to agree to take the fast path. Fortunately, all leaderless protocols based on the intuitions of EPaxos can likely be modified to support cross-shard transactions, so we may assess all such protocols.

EPaxos
While Janus has poor failure properties, the EPaxos protocol can be extended in a similar fashion while achieving almost the same failure properties - we must only increase the quorum size by one, or else utilise a proxy coordinator for each shard. EPaxos is the first leaderless protocol to demonstrate a fast path, however it suffers from high contention as every fast path participant must have witnessed every dependency in order for the fast path to be agreed.

Strict-serializable isolation is not a property of the basic EPaxos protocol, however it is possible to support it by responding to clients only once a transaction’s dependencies are committed. This is an acceptable but unnecessary limitation, as we will later demonstrate.

Latency of EPaxos is good if there are no conflicts, but like other leaderless protocols its failure properties are poor: the fast path will be disabled once a quarter of replicas are unreachable, so that the system will incur significantly higher load and slower transactions at this point, creating a risk of cascading failure.

Caesar
This protocol builds on the intuitions of EPaxos but instead of unanimously agreeing dependencies, it instead unanimously agrees an execution timestamp. This confers a lower conflict rate, as each voting replica may witness a different history - we only require that some super-majority sees the transaction’s timestamp before any higher timestamp. The use of a timestamp confers additional benefits we will outline below.

Caesar builds a set of dependent transactions that are equivalent to those of EPaxos in the aftermath of agreeing an execution timestamp, where the dependencies are consistent with the timestamp execution order but permit commutative operations (mostly reads with respect to other reads) to not interfere with each other, so that they may be performed concurrently.

However, Caesar has downsides: transactions may livelock indefinitely (though this is implausible, in practice this may lead to latency spikes), and its worst case latency is three wide area round-trips. It has the same suboptimal failure properties as other leaderless protocols.

Tempo
Tempo improves upon Caesar for latency, guaranteeing two-round agreement without caveat. Unfortunately it suffers from a read scalability bottleneck due to its inability to exploit transaction commutativity, and has particularly poor failure properties.

Space shortcuts

Page tree

Motivation

Goals

Prior Work