Clustering and Partitioning

Picture 1.

A cluster in Ignite is a set of interconnected nodes where server nodes (Node 0, Node 1, etc. in Picture 1) are organized in a logical ring and client nodes (applications) are connected to them. Both server and client nodes can execute all range of APIs supported in Ignite. The main difference between the two is that the clients do not store data.

Data that is stored on the server nodes is represented in a form of key-value pairs. The pairs in their turn are located in specific partitions which belong to individual Ignite caches as it's shown in Picture 2:

Picture 2.

To ensure data consistency and comply with the high-availability principle, server nodes are capable of storing a primary as well as backup copies of data. Basically, there is always a primary copy of a partition with all its key-value pairs in the cluster and might be 0 or more backup copies of the same partition depending on the configuration parameters.

Each cluster node (servers and clients) are aware of all primary and backup copies of every partition. This information is collected and broadcasted to all the nodes from a coordinator (the oldest server node) via internal partition map exchange messages.

However, all the data related requests/operations (get, put, SQL, etc.) go to primary partitions except for some read operations when CacheConfiguration.readFromBackup is enabled. If it's an update operation (put, INSERT, UPDATE) then Ignite ensures that both the primary and backup copies are updated and stay in a consistent state.

Transactions

This section dives into the details of Ignite transactional protocol. High-level principles and features are described in Ignite technical documentation.

Two-Phase Commit Protocol

A single transaction in distributed systems usually spans across several server nodes which imposes additional requirements for the sake of data consistency. For instance, it is obligatorily to detect and handle situations when a transaction was not fully committed due to a partial outage or cluster nodes loss. Ignite relies on two-phase commit for handling this and many other situations in order to ensure data consistency cluster-wide.

As the protocol name suggests, a transaction is executed in two phases. The "prepare" phase goes first:

Picture 3.

Transaction coordinator (aka. near node or application that runs a transaction) send "prepare" message (step 1) all primary nodes participating in the transaction.
The primary nodes forward the message to nodes that hold a backup copy of data if any (step 2) and acquire all the required data locks.
The primary nodes acknowledge (step 4) that all the locks are required and they are ready to commit the transaction.

Right after that, the transaction coordinator executes the second phase by sending "commit" message:

Picture 4.

Once the backup and primary copies are updated, the transaction coordinator gets acknowledged and assumes that the transaction is finished.

This is how the 2-phase commit works in a nutshell. Below we will see how the protocol tolerates failures, distinguishes pessimistic and optimistic transaction and does many other things.

Near Node and Remote Node

The transaction coordinator is also known as a near node among Ignite community and committers. The transaction coordinator (near node) initiates a transaction, tracks its state, sends over "prepare" and "commit" message, orchestrates the overall transaction process. Usually, the coordinator is a client node that connects our applications to the cluster. The application triggers tx.call(), cache.put()/get(), tx.commit() methods and the client node takes care of the rest as it's shown below:

Picture 5.

In addition to the transaction coordinator, the transactional protocol defines remote nodes which are server nodes that keep a part of the data being accessed or updated inside of the transaction. Internally, every server node maintains a distributed hash table (DHT) for partitions it owns. The DHT helps to look up partition owners (primary and backups) efficiently from any cluster node including the transaction coordinator. Note, that the data itself is stored in pages that are arranged by B+Tree (refer to memory architecture documentation for more details).

Locking Modes and Isolation Levels

This section elaborates on locking modes and isolation levels supported in Ignite. The high-level overview is given in Ignite technical documentation.

Pessimistic Locking

In multi-user applications, different users can modify the same data simultaneously. To deal with reads and updates of the same data sets happening in parallel, transactional subsystems of products such as Ignite implement optimistic and pessimistic locking. In the pessimistic mode, an application will acquire locks for all the data it plans to change and will apply the changes after all the locks are owned exclusively while in the optimistic mode the locks acquisition is postponed to a later phase when a transaction is being already committed.

Locks acquisition time also depends on a type of isolation level. Let's start with the review of isolation levels in conjunction with the pessimistic mode.

In pessimistic & read committed mode the locks are acquired before the changes brought by write operations such (as put or putAll) are applied as it's shown below:

Picture 6.

However, Ignite never obtains locks for read operations (such as get or getAll) in this mode which might be inappropriate for some of the use cases. To ensure the locks are acquired prior every read or write operation, use either repeatable read or serializable isolation level in Ignite:

Picture 7.

The pessimistic mode holds locks until the transaction is finished that prevents accessing locked data from other transactions. Optimistic transactions in Ignite might increase the throughput of an application by lowering contention among transactions by moving the locks acquisition to a later phase.

Optimistic Locking

In optimistic transactions, locks are acquired on primary nodes during the "prepare" phase, then promoted to backup nodes and released once the transaction is committed. Depending on an isolation level, if Ignite detects that a version of an entry has been changed since the time it was requested by a transaction then the transaction will fail at the "prepare" phase and it will be up to an application to decide whether to restart the transaction or not.

For the isolation level Read committed and Repeatable read locks are obtained at the time of implementation of phase "prepare", at the same time check that the version has not changed since the beginning of the transaction is not executed.

Page tree

Ignite Key-Value Transactions Architecture