ID	IEP-72
Author	Vladislav Pyatkov
Sponsor	Vladislav Pyatkov
Created	12 Apr 2021
Status	DRAFT

For Ignite 3.x the concept is formulated for the distributed table. The table is the base component that allows a store and updates data in the cluster. The table provides a base guaranty of consistency of data writes/reads.

Motivation

All distributed table structures require to have redundancy of storing data to avoid losing entries when one (or more) member(s) of the structure goes down. Moreover, the data which is available should be consistent every time, while a part of the data available in the structure.

The atomic protocol should provide an ability to keep data redundancy level and keep data consistency until all the copies of the data are lost.

Interface

Table creation

Table creation requires next parameters to be specified for the protocol purpose:

Amount of partitions - the total number of parts the data will be distributed among in the cluster.
A number of replicas - a redundancy level, desired number of copies each partition should have (1 means no redundancy - a single copy of partition will be in a cluster).

Key value interface

The familiar interface for atomic storage in a table is available through the Key-Value view[1] of a table:

Key Value interface

public interface KeyValueView<K, V>

All batch methods won't have atomically guarantees and added for optimization on network communication.

It is an analogue of Ignite cache interface from Ignite 2.x.

Implementation

Partition consistency

Every partition replicas will be served by one RAFT group (it is implemented into IEP-61[2]). All synchronization guaranties between replies will be provided by RAFT protocol.

Since the RAFT elect leader as it wants, a no difference between primary and backup replies - all the replicas are equal for the atomic protocol implementation.

Partition distribution

All partitions should be distributed around the cluster as even as it is possible to balanced load. For this purpose we will use an implementation Rendezvous affinity function (the similar one as use for Ignite 2.x).

The function is calculated once in the cluster and store to Distributed metastorage, all nodes gets the distribution and uses locally (not need to recalculate) before the table will be available.

Mapping entry to partition

All table entry will have two parts:

Key part (Affinity part), it is a set of unique columns. It can be interpreted as the primary key.
Value part, it is another columns.

Key part used to calculate what the partition would store the entry.

Flow description

Table starts to create through public API. In the time a partition distribution is being calculated and will have been available into each node when the table is returned to the client code.

Every invocation of the table API determined a set of data entries which mapped to a partition by a key part of the entries. Distribution determines a RAFT group for specific partition, every partition update is transformed to the RAFT command and applied through RAFT group API.

Links

Tickets

key	summary	type	created	updated	due	assignee	reporter	priority	status	resolution
JQL and issue key arguments for this macro require at least one Jira application link to be configured

Page tree

IEP-72: Atomic protocol