Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Each of these mechanisms has its own implementation, while none of them can guarantee consistency out of the box during network partitioning. This document suggests a design which that will eliminate duplicating logical pieces of code and provide robust building blocks for cluster metadata management and cache protocols.

...

  • SWIM [3], an eventually-consistent protocol widely used by multiple systems, with Java implementation available [4].
  • RAPID [5], a novel consistent group membership protocol, with Java implementation also available [6].

Cluster and node lifecycles

As described in IEP-55, nodes in a cluster rely on local and distributed metastorages during the operation. The following actions facilitate the usage of the metastorages:

  • Local metastorage initialization. This step is executed before the very first startup of any Ignite 3.0 node and may be implicit (the local metastorage is initialized with default settings in this case). Upon initialization, the local metastorage contains no information about the distributed metastorage.
  • Once a node starts, the discovery (group membership) starts working to maintain the list of online nodes.
  • Until the distributed metastorage is initialized, the nodes in a cluster remain in a 'limbo' state awaiting the distributed metastorage initialization.
  • Distributed metastorage is initialized either by an administrator command or using a pre-specified set of nodes. During the metastorage initialization, an initial metastorage Raft group is created.
  • All nodes record the last known configuration of the metastorage Raft group. If written, the node attempts to communicate with the distributed metastorage using the available configuration.
  • The changes in cluster membership do not directly affect the configuration of the distributed metastorage Raft group, nor the configuration of the partitions. Reported node unavailability via the cluster membership should be considered as a hint for other layers, but the Raft group can still try to periodically contact the unavailable member in an attempt to make progress under partial network partitions (since SWIM protocol is eventually consistent).
  • All nodes subscribe to the metastorage update feed which delivers updates to the distributed metastorage as a linearizable consistent sequence of distributed metastorage modifications. Along with the metastorage Raft group configuration changes, it contains the changes to the cluster configuration (such as the list of caches, their configuration, partitions assignment, etc). Each node reacts to the changes from the distributed metastorage by adjusting the local configuration and performing corresponding actions (creating/destroying tables, moving partitions, etc).

...

  • If one or more of the nodes from the metastorage Raft group goes offline, but the Raft group remains available (a majority of the nodes are online), the cluster remains available. The metastorage group can be reconfigured to include a new member (to mitigate the risk of further unavailability) either automatically (similar to baseline auto-adjust), or manually by the administrator command.
  • If the number of offline nodes in the metastorage Raft group is larger than the quorum, the metastorage becomes unavailable. Further configuration changes to the cluster start to temporarily fail, but table operations keep working as long as corresponding partitions are available. Once the majority of the metastorage Raft group becomes available, the changes to the cluster configuration become available as well.
  • If more than the majority of the metastorage Raft group members are permanently lost, it is impossible to restore the metastorage availability without potential data loss. In this case, the cluster is moved to the recovery mode and the metastorage Raft group is re-initialized using one of the available members as the 'golden source'.

Application to caches protocol

Partition Replication

The replication protocol can be used as an abstraction for hiding primary-backup nodes replication so that upper layers work with partition objects regardless of how many nodes the data is replicated to. In this case, the atomic cache operations become trivial CRUD operations replicated by the protocol. Moreover, there is no need to differentiate between atomic and transactional caches as multi-partition transactions become a set of operations that are applied to partitions (which are, in turn, replicated within each partition).

Additionally, since log replication and operation application are separated, the latency of an update will not depend on the complexity of the operation itself (for example, the number of secondary indexes used for a given table).

Among others, the replication module provides the following interfaces to the storage layer:

  • A stream of committed operations that are guaranteed to be the same on all nodes in the replication group. Each operation in the stream is provided with a unique monotonous continuous index assigned by the replication module. The stream is durable and can be re-read as many times as needed as long as operations have not been compacted by a user/system request. The committed operations can be applied to the storage asynchronously as long as read operations are properly synchronized with the storage state to make sure to read-only when the needed operations are applied.
  • A readIndex() operation that returns the most up-to-date committed (durable) operation index in the replication group providing linearizability. readIndex() performance varies from protocol to protocol: for canonical Raft, this operation involves a round-trip to the majority of group members from the current leader, while in PacificA primary node can return the index locally as long as the primary lease is valid. If a node has received and applied an operation with an index at least as large as the one returned by readIndex(), the state can be safely read locally.

Check the IEP-99: Pluggable replication page for details on a pluggable replication protocol.

Atomic protocol

The main idea behind the atomic protocol in Ignite 2.x was a performance - because people not always need a transaction and associated coordination overhead.

Nevertheless, the atomic protocol implementation in Ignite 2.x has multiple flaws - absence of atomicity for batch operations (de facto atomicity per key), best effort consistency, side effects during retries on non-idempotent operations, the presence of PartialUpdateException for manual retries when automatic is not possible. Moreover, if a cache was created in ATOMIC mode, no transaction can be run on it due to protocols incompatibility.

Looks like we need only one cache type in Ignite 3 with strong guarantees for a better user experience.

New transaction protocol will work faster for single partition batches, allowing to commit as soon as all entries are replicated, and eliminating performance issues related to coordination overhead, making atomic protocol obsolete.

Initial data loading

A special case of initial data loading is worth mentioning. This scenario must be executed as fast as possible, so a cache can be moved to a special state, allowing it to use other protocol for initial data loading with weaker guarantees, disallowing concurrent transactions during the process.

This is a topic for separate IEP.

Transactional protocol

The transaction protocol is build on top of partition replication.

IEP-91: Transaction protocol

Data Structures Building Block

...

  1. Raft Consensus Protocol
  2. PacificA Replication Protocol
  3. SWIM Group Membership Protocol
  4. Java implementation of SWIM
  5. RAPID Group Membership Protocol
  6. Java implementation of RAPID


Tickets

Jira
serverASF JIRA
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
maximumIssues20
jqlQueryproject=IGNITE and labels in (iep-61)
serverId5aa69414-a9e9-3523-82ec-879b028fb15b