ID	IEP-72
Author	Vladislav Pyatkov
Sponsor	Vladislav Pyatkov
Created	12 Apr 2021
Status	DRAFT

For Ignite 3.x the concept is formulated for the distributed table. The table is the base object which allow a store and update data in the cluster. The base guaranty would it provide it is a consistency of data entry write/read.

Motivation

All distributed table structures required to have redundant of storing data to avoid loosing entries when one of the member of the structure got down. More over the data which is available should be consistent every time, while a part of the data available in the structure.

The atomic protocol should provide an ability to store redundant data and keep consistency until all copes of the data lose the cluster.

Interface

Table creation

For create any table you need to specify two parameters for the protocol purpose:

Partitions - this is the number of part of data will divided in the cluster.
Replicas - this is the number of copy of each partition would be created (1 means the cluster would have only copy without redundancy).

Key value interface

The familiar interface for atomic storage in a table is available through the Key-Value view[1] of a table:

Key Value interface

public interface KeyValueView<K, V>

All batch methods won't have atomically guaranties and added for optimization on network communication.

It is an analogue of Ignite cache interface form Ignite 2.x.

Implementation

Partition consistency

Every partition replicas will be served by one RAFT group (it is implemented into IEP-61[2]). All synchronization guaranties between replies will be provided by RAFT protocol.

Since the RAFT elect leader as it wants, a no difference between primary and backup replies - all the replicas are equal for the atomic protocol implementation.

Partition distribution

All partitions should be distributed around the cluster as even as it is possible to balanced load. For this purpose we will use an implementation Rendezvous affinity function (the similar one as use for Ignite 2.x).

The function is calculated once in the cluster and store to Distributed metastorage, all nodes gets the distribution and uses locally (not need to recalculate) before the table will be available.

Mapping entry to partition

All table entry will have two parts:

Key part (Affinity part), it is a set of unique columns. It can be interpreted as the primary key.
Value part, it is another columns.

Key part used to calculate what the partition would store the entry.

Flow description

Table starts to create through public API. In the time a partition distribution is being calculated and will have been available into each node when the table is returned to the client code.

Every invocation of the table API determined a set of data entries which mapped to a partition by a key part of the entries. Distribution determines a RAFT group for specific partition, every partition update is transformed to the RAFT command and applied through RAFT group API.

Links

Tickets

Unable to render Jira issues macro, execution error.

Page tree

IEP-72: Atomic protocol