Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Clustering and Partitioning 

 

Picture Figure 1.

A cluster in Ignite is a set of interconnected nodes where server nodes (Node 0, Node 1, etc. in Picture Figure 1) are organized in a logical ring and client nodes (applications) are connected to them. Both server and client nodes can execute all range of the APIs supported in Ignite. The main difference between the two is that the clients client nodes do not store data.

Data that is are stored on the server nodes is represented in a form of are represented as key-value pairs. The pairs in their turn are located in specific partitions which belong to individual Ignite caches as it's shown in Picture Figure 2: .

Picture Figure 2.


To ensure data consistency and comply with the high-availability principle, server nodes are capable of storing a primary as well as backup copies of data. Basically, there is always a primary copy of a partition with all its key-value pairs in the cluster and might there may be 0 or more backup copies of the same partition depending on upon the configuration parameters.

Each Every cluster node (servers and clients) are server or client) is aware of all primary and backup copies of every partition. This information is collected and broadcasted broadcast to all the nodes from a coordinator (the oldest server node) via internal partition map exchange messages.

However, all the data related requests/operations (get, put, SQL, etc.) go to primary partitions except for some read operations when CacheConfiguration.readFromBackup is enabled. If it's an update operation (put, INSERT, UPDATE) then Ignite ensures that both the primary and backup copies are updated and stay in a consistent state.

Transactions

This section dives into the details of the Ignite transactional protocol. High-level principles and features are described in the Ignite technical documentation.

...

A single transaction in distributed systems usually spans across several server nodes which imposes additional requirements for the sake of data consistency. For instance, it is obligatorily to detect and handle situations when a transaction was not fully committed due to a partial outage or cluster nodes loss. Ignite relies on two-phase commit for handling this and many other situations in order to ensure data consistency cluster-wide. 

As the protocol name suggests, a transaction is executed in two phases. The "prepare" phase goes first

Image Removed

, as shown in Figure 3.

Image Added

Figure Picture 3.
 
  1. Transaction coordinator (aka. also known as near node or application that runs a transaction) send sends a "prepare" message (step Step 1) to all primary nodes participating in the transaction.
  2. The primary nodes forward the message to nodes that hold a backup copy of data if any (step Step 2) and acquire all the required data locks.
  3. The primary nodes acknowledge (step Step 4) that all the locks are required acquired and they are ready to commit the transaction.


Right after thatNext, the transaction coordinator executes the second phase by sending a "commit" message: 

Image Removed

, as shown in Figure 4.

Image Added

Figure Picture 4.
 

Once the backup and primary copies are updated, the transaction coordinator gets acknowledged receives an acknowledgement and assumes that the transaction is finished.

This is how the 2-phase commit works in a nutshell. Below we will see how the protocol tolerates failures, distinguishes pessimistic and optimistic transaction transactions and does many other things. 

Near Node and Remote Node

The transaction coordinator is also known as a near node among in the Ignite community and committers. The transaction coordinator (near node) initiates a transaction, tracks its state, sends over "prepare" and "commit" messagemessages, and orchestrates the overall transaction process. Usually, the coordinator is a client node that connects our applications to the cluster. The An application triggers tx.call(), cache.put()/, cache.get(), and tx.commit() methods and the client node takes care of the rest as it's shown below:

Image Removed

shown in Figure 5.

Image Added

Figure Picture 5.

In addition to the transaction coordinator, the transactional the transaction protocol defines remote nodes which are server nodes that keep a part of the data being accessed or updated inside of the transaction. Internally, every server node maintains a distributed hash table a Distributed Hash Table (DHT) for partitions it owns. The DHT helps to look up partition owners (primary and backups) efficiently from any cluster node including the transaction coordinator. Note , that the data itself is are stored in pages that are arranged by B+Tree (refer to memory architecture documentation for more details).

...

This section elaborates on locking modes and isolation levels supported in Ignite. The A high-level overview is given also available in the Ignite technical documentation.

...

 

In multi-user applications, different users can modify the same data simultaneously. To deal with reads and updates of the same data sets happening in parallel, transactional transaction subsystems of products such as Ignite implement optimistic and pessimistic locking. In the pessimistic mode, an application will acquire locks for all the data it plans to change and will apply the changes after all the locks are owned exclusively while in the optimistic mode the locks lock acquisition is postponed to a later phase when a transaction is being already committed.

Locks Lock acquisition time also depends on a the type of isolation level. Let's start with the review of isolation levels in conjunction with the pessimistic mode.    

 

In pessimistic & and read committed mode the , locks are acquired before the changes brought by write operations such (as put or putAll) are applied, as it's shown below:shown in Figure 6.

 
Picture Figure 6.

However, Ignite never obtains locks for read operations (such as get or getAll) in this mode which might be inappropriate for some of the use cases. To ensure the locks are acquired prior to every read or write operation, use either repeatable read or serializable isolation level in Ignite:
Image Removed
, as shown in Figure 7.
Image Added
Figure Picture 7.
 

The pessimistic mode holds locks until the transaction is finished that and prevents accessing access to locked data from other transactions. Optimistic transactions in Ignite might increase the throughput of an application by lowering contention among transactions by moving the locks lock acquisition to a later phase.

Optimistic Locking

In optimistic transactions, locks are acquired on primary nodes during the "prepare" phase, then promoted to backup nodes and released once the transaction is committed. Depending on an isolation level, if Ignite detects that a version of an entry has been changed since the time it was requested by a transaction, then the transaction will fail at the "prepare" phase and it will be up to an application to decide whether to restart the transaction or not. This is exactly how optimistic & and serializable transactions (aka. also known as deadlock-free transactions) work in Ignite:, as shown in Figure 8.


Picture Figure 8.

On the other hand, repeatable read and read committed optimistic transactions never check if a version of an entry is changed. This mode (Figure 9) might bring extra performance benefits but does not give any atomicity guarantees and, thus, is rarely used in practice:  .

 Image Removed

Image Added


Picture Figure 9.

Transaction Lifecycle

Now let's review the entire lifecycle of a transaction in Ignite. Presently it's assumed We will assume that the cluster is stable and no any outages happen.

Image Added 


 
Image Removed
Picture Figure 10.

 The In Figure 10, the transaction starts once tx.start() (step Step 1) method is called by the application which results in the creation (step Step 2) of a structure for transaction context management (aka. also known as IgniteInternalTx). In addition to thatAdditionally, the following happens on the near node side:
  
  • a A unique transaction identifier is generated;.

  • the The start time of the transaction is recorded;.

  • current Current topology version/state is recorded;etc.

Once after thatNext, the transaction status is set to "active" and Ignite starts executing read/write operations that are a part of the transaction following using the rules of either optimistic or pessimistic mode and specific isolation levels.

Transaction Commit

 

When the application executes the tx.commit() method (step Step 9 in Picture Figure 10 above) the near node (transaction coordinator) initiates the 2-phase commit protocol by preparing the "prepare" message enclosing information about the transaction's context into it.

As a part of the "prepare" phase, every primary node receives information about all updated or new key-value pairs and about an the order the locks have to be acquired (the latter depends on a combination of locking modes & and isolation levels). 

The primary nodes on in their turn perform the following in response to the "prepare" message:

 
  • check up Check that a version of the cluster topology recorded in the transaction's context matches the current topology version;.

  • obtain Obtain all the required locks;.

  • create Create a DHT context for the transaction and storing store all the necessary data therein;.

  • depending on Depending upon the cache configuration parameters, wait or skip waiting while backup nodes confirm that the "phaseprepare" phase is over;.

  • inform Inform the near node that it's time to execute the "commit" phase.

After that, the near node sends the "commit" message, waits for an acknowledgment and moves the transaction in status to "committed".

Transaction Rollback


If the transaction was rolled back (tx.rollback() is called by the application), then, in the pessimistic mode, Ignite would be required to release all the acquired locks and delete the transaction's context. In the optimistic mode the locks are acquired when tx.commit() is called by the application, therefore, Ignite would simply clean out the transaction's context on the transaction coordinator (near node).

Transaction Timeout

Ignite allows setting a timeout for a transaction. If transaction's execution time exceeds the timeout, then the transaction will be aborted.

In the pessimistic mode, the timeout is compared to the current total execution time every time an entry lock is acquired and when the "prepare" phase is triggered. In the optimistic mode, the timeout is compared only on in the "prepare" phase. Figure 11 shows a summary.

Image RemovedImage Added

Picture Figure 11.


If the total execution time has exceeded the timeout at least on any of the participating nodes, then a primary node, where the timeout elapsed, sets a special flag instructing the transaction coordinator to initiate a transaction cancellation.


Failover and Recovery

 

The section explains how Ignite tackles failover situations or outages that might happen while transactions are being executed.

Backup Node Failure

Image Removed

Image Added

Picture Figure 12.

The simplest failure scenario to tackle is when a backup node fails on either the "prepare" or the "commit" phasesphase. Nothing has to be done by the Ignite transactional subsystem. Transaction modifications will be applied to the remaining primary and backup nodes (Figure 12) and a new backup node for missing partitions will be elected after the transaction is over and that will not will preload all the up-to-date data from a respective primary node.

Primary Node Failures on Prepare Phase

Image Removed

Image Added

Picture Figure 13.

 

 

If a primary node failed before or on the "Prepare" phase, then the transaction coordinator raises an exception (Figure 13) and it's up to the application to decide what to do next - restart the transaction or process this exception differently.

 

Primary Node Failures on Commit Phase

If a primary node failed after the "prepare" phase, then the transaction coordinator will be waiting for an extra NodeFailureDetection response from respective backup nodes (Figure 14).

Image Removed

Image Added

 Figure 14 Picture 14.


Once the backup nodes detect the failure they will send this a message to the transaction coordinator confirming that they successfully committed the changes and no data loss happened because there is still an extra backup copy available for application usage. 

Right after thatNext, the transaction coordinator finishes the transaction, the topology is changed (due to the primary node loss) and the cluster will elect a new primary for the partitions that were stored on the previous one.

Transaction Coordinator Failure

Handling of transaction coordinator failures is a bit trickier because every remote node (primary and backup) is aware of the transaction's context related to it and doesn't know the overall transaction state. It is also possible that some of the nodes have already received "commit" message while the others haven't, as shown in Figure 15.


Image Added
Figure 15.

To solve this situation, the primary nodes exchange internal status with each other to find out the overall transaction state. For instance, if one of the nodes responds that it hasn't received a "commit" message, then the transaction will be rolled back globally.

Transaction guarantees schematically



X - Concurrency

Y - Isolation

Pessimistic

Read_commited

Pessimistic

Repeatable_read

Serializable

Optimistic

Read_committed

Optimistic

Repeatable_read

Optimistic

Serializable

// Open tx with X concurrency

// and Y isolation

tx = txStart(X, Y) {









// Read data inside tx.


Account acct1 = cache.get(acctId1);

Account acct2 = cache.get(acctId2);

Inconsistent read possible

Yes

No

Yes

Yes

No*

Read value cached in tx (subsequent reads return same value)

No

Yes

No

Yes

Yes

Lock obtained on read

No

Yes

No

No

No

// Make changes to the data.
acct1.update(100);
acct2.update(200);







// Store updated data in cache.
cache.put(acctId1, acct1);
cache.put(acctId2, acct2);

Lock obtained on write

Yes

N/A (prelocked)

No

No

No

// Commit the tx.

tx.commit();

}

Lock obtained on commit

N/A (prelocked)

N/A (prelocked)

Yes

Yes

Yes**

Consistency validated on commit

No

N/A (prelocked)

No

No

Yes


* Even though inconsistent read is possible within the transaction logic, it will be validated during commit phase and the whole transaction will be rolled back. 

** Even though the locks are acquired on commit phase, in OPTIMISTIC SERIALIZABLE mode the locks are not acquired sequentially. Only individual locks are acquired in parallel. This means that multiple entries are not locked together and deadlocks are not possible.