Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

These existing leaderless approaches achieve low latency for uncontended keys so long as a super-majority of replicas remain healthy, however none have optimal failure characteristics, latency still suffers under contention and as few as a quarter of failed replicas forces the system to slow-path consensus, harming latency and increasing cluster load.

We propose a new leaderless timestamp protocol Accord, with better properties than Caesar or Tempo and for which we have derived formal proofs of correctness (to be published at a later date). We also propose two new techniques: a Reorder Buffer that guarantees one round-trip consensus with a timestamp protocol, and Fast Path Electorates which permit a leaderless fast-path to remain accessible under worst-tolerated failures.

We propose incubating Accord within the project to be developed as a standalone library. It is designed for integration into Cassandra with the sole purpose of supporting these goals. A prototype has been developed [link] that has demonstrated its correctness against Jepsen.io’s Maelstrom tool and a similar in-tree facility, but remains incomplete and not ready for production use. We propose incorporating it into the project under a new repository to be developed alongside necessary integrations and improvements to Cassandra.

Reorder Buffer
This technique is very simple: the cluster determines (through some secondary mechanisms) the maximum clock skew between any two nodes in a replica set, and the point-to-point latencies. Timestamp proposals are then buffered at replicas for a period equal to this clock skew and the longest point-to-point latency (minus the latency between the transaction’s coordinator and the receiving replica). This means messages are only processed once any other coordinator’s message that might conflict must have arrived (unless the coordinator process was unhealthy and stalled, for instance). Messages are then processed in timestamp order, guaranteeing that they can be acknowledged for fast path participation. By performing this on every replica, a coordinator is guaranteed to obtain a fast path quorum.

This approach can be modified further to await only those messages from coordinators that require our participation to reach fast path agreement (i.e. in heterogenous setups where some coordinator is farther away but is able to reach quorum nearer to itself, it might be possible to wait only on those messages from nearer coordinators that likely depend on our participation for their fast path)

Fast Path Electorate
We define an electorate to be those replicas that may vote on a decision. Ordinarily this is the full replica set, but for fast path decisions it is possible to exclude some replicas, so that the electorate is a subset of the full replica set. In this case fewer votes are needed to reach fast-path consensus. Specifically, for every two nodes removed, one fewer vote is needed. This is based on the observation that fast path quorums are sized to simultaneously intersect any other fast path quorum and any recovery quorum (typically a simple majority). This is where the 3/4 of replicas vote size is derived from (EPaxos reduces this by one by requiring that the coordinator is a replica, however it does so in a manner that makes cross-shard transactions exceptionally complex and a reorder buffer unworkable, so we discard this particular optimisation as this new approach makes it redundant).

To demonstrate that this approach may always reach consensus under maximal failure, we can first observe that in this eventuality only a single simple quorum is reachable. We may configure the fast path electorate to contain only the members of this one simple quorum, so that all fast and slow path operations must use this same quorum. By definition, this one quorum must overlap with any other simple quorum we might use for recovery, and also by definition overlaps with every other fast path quorum (as there is only one possible fast path quorum in this configuration). Since all of the other nodes are offline, we lose nothing by restricting our fast path agreement to this single quorum, but gain optimal failure properties by being able to do so.

Accord
In order to exploit the above innovations, we must select a timestamp protocol. To ensure reads are scalable we must also derive dependencies, so Caesar is the only candidate in the literature. Unfortunately Caesar has suboptimal characteristics: ACKs to older transactions may be delayed by newer ones (in theory indefinitely), and three round-trips may be required. Fortunately this can be remedied. Caesar appears to suffer this fate due to assembling a precise set of dependencies so that only one possible set of dependencies may be committed (and perhaps also because of its recovery protocol). Accord instead assembles an inconsistent set of dependencies. By inconsistent we mean that it may differ between coordinators, and we are OK with different sets of dependencies being committed to different replicas. We only require that all of these sets of dependencies are a superset of those that may be committed with a lower timestamp.

Preliminaries
We use hybrid logical clocks that are globally unique, that is each replica has its own unique id that is appended to each logical clock value.

Fast Path
A coordinator C proposes a timestamp t0 to at least a quorum of a fast path electorate. If t0 is larger than all timestamps witnessed for all prior conflicting transactions, t0 is accepted by a replica. If a fast path quorum of responses accept, the transaction is agreed to execute at t0. Replicas respond with the set of transactions they have witnessed that may execute with a lower timestamp, i.e. those with a lower t0.

Slow Path
If a replica refuses to accept t0 it responds with a higher t than any other it has witnessed. If C fails to reach fast path consensus it takes the highest t it witnessed from its responses, which constitutes a simple Lamport clock value imposing a valid total order. This value is proposed to at least a simple majority of nodes, along with the union of the dependenciesreceived in the preaccept phase. This round’s purpose is only to record durably which Lamport clock value that might be derived was selected (as multiple valid Lamport clock values might be obtained depending on which responses were received by C), so that if C crashes a recovery coordinator will pick the same timestamp. The inclusion of dependencies in the proposal is solely to facilitate Recovery of other transactions that may be incomplete - these are stored on each replica to facilitate decisions at recovery. Replicas as a result always accept the proposal (unless a newer ballot has been issued by a recovery coordinator to take over the transaction), and once a majority have accepted the proposal it is durably decided. Replicas respond with a new set of dependencies containing those transactions they have witnessed with a lower t0 than the t they received. The coordinator discards the dependencies it received previously on the fast path and uses these new dependencies for execution.