Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Current state: Under Discussion

Discussion threadhttps://lists.apache.org/thread/h25skwkbdztz9hj2pxtgh39rnjfzckk7

JIRA

Released:

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Cluster metadata in Cassandra comprises a number of disparate elements including, but not limited to, distributed schema, topology and token ownership. Following the general design principles of Cassandra, the mechanisms for coordinating updates to cluster state have favoured eventual consistency, with probabilisitic delivery via gossip being a prime example. Undoubtedly, this approach has benefits, not least in terms of resilience, particularly in highly fluid distributed environments. However, this is not the reality of most Cassandra deployments, where the total number of nodes is relatively small (i.e. in the low thousands) and the rate of change tends to be low.  

Historically, a significant proportion of issues affecting operators and users of Cassandra have been due, at least in part, to a lack of strongly consistent cluster metadata. In response to this, we propose a design which aims to provide linearizability of metadata changes whilst ensuring that the effects of those changes are made visible to all nodes in a strongly consistent manner. At its core, it is also pluggable, enabling Cassandra-derived projects to supply their own implementations if desired.

Probabilistic propagation of cluster state is suboptimal and not necessary, given the scale that Cassandra typically operates at. Not to mention it is (by definition) unreliable and, since there is no strict order for changes propagated through gossip, there can be no universal "reference" state at any given time. We aim to add a mechanism for establishing the order of state change events, and replace gossip as the primary mechanism of their dissemination in order to enable nodes which are participating in state-changing operations to detect when these intersect with other ongoing operations, such as reads and writes. The first two areas we plan to apply these changes to are Schema and Token Ownership. Establishing global order for the events requires us to add transactional semantics to changes in the state.

Token Ownership

Besides the fact that gossip propagation is slow and unreliable, using gossip for state changes leads to short-term replication factor violations. For example, a gossip-backed bootstrap process allows streaming to be started before there's a guarantee that any coordinator will make a write towards the bootstrapping replica; similarly, during bootstrap it is possible for the effects of the write operation, followed by the read to the same replica set, not to be visible. While a problem with streaming is a relatively minor reduction in durability guarantees, a problem with disagreeing read/write sets is a transient data loss, a consistency violation that likely goes unnoticed on a regular basis. Worse yet, there is no guaranteed upper bound for a difference in how much to coordinators may diverge from each other, which means that a coordinator may try to read from, or write to replicas the replicas that do not own the range at all.

...