Problem Description And Test Case

In Ignite 1.x implementation general reads performed out-of-transaction (such as getAll() or SQL SELECT) do not respect transaction boundaries. This problem is two-fold. First, local node transaction visibility is not atomic with respect to multi-entry read. Committed entry version is made visible immediately after entry is updated. Second, there is no visible version coordination when a read involves multiple nodes. Thus, even if local transaction visibility is made atomic, this does not solve the issue.

The problem can be easily described using a test case. Let's say we have a bank system with a fixed number of accounts and we continuously run random money transfers between random pairs of accounts. In this case the sum of account balances is a system invariant and must be the same for any getAll() or SQL query.

General Approach Overview

The main idea is that every node should store not only the current (last) entry value, but also some number of previous values in order to allow consistent distributed reads. To do this, we need to introduce a separate node role - transaction version coordinators - which will be responsible for assigning a monotonically growing transaction version as well as maintaining versions of in-progress transactions and in-progress reads. The last committed transaction ID and IDs of pending transactions define the versions that should be visible for any subsequent read. The IDs of pending reads defines the value versions that are no longer needed and can be discarded.

Version Coordinator(s)

In the initial version of distributed MVCC we will use single transaction coordinator that will define the global transaction order for all transactions in the cluster. The coordinator may be a dedicated node in the cluster. Upon version coordinator failure a new coordinator should be elected in such a way that the new coordinator will start assigning new versions (tx XIDs as well) that is guaranteed to be greater than all previous transaction versions (using two longs: coordinator version which is a topology major version and starting from zero counter).

TxLog

To be able to determine state of TX, which has changed a particular row, a special structure (TxLog) is introduced. TxLog is a table (can be persistent in case persistence enabled) which contains MVCC version to transaction state mappings.

TxLog is used to keep all the data consistent on cluster crush and recovery:

If a particular Tx has ACTIVE or ROLLED_BACK state on at least one data node it marks as ROLLED_BACK on all nodes.
If a particular Tx has COMMITTED state on at least one data node it marks as COMMITTED on all nodes.
If a particular Tx has PREPARED state on all data nodes and all involved partitions are available it marks as COMMITTED on all nodes.
If a particular Tx has PREPARED state on all data nodes and at least one involved partition is lost (unavailable) it is left in PREPARED state, all the entries are locked for updates until state is changed manually or lost partition become available.

Internal Data Structures Changes

BTree leafs structure is changed as follow:

|           key part          |           |         |
|-----------------------------|  lockVer  |   link  |
| cache_id | hash |  mvccVer  |           |         |

cache_id - cache ID if it is a cache in a cache group
hash - key hash
mvccVer - MVCC version of transaction who created the row
lockVer - MVCC version of transaction who holds a lock on the row
link - link to the data

Rows with the same key are placed from newest to oldest.

Index BTree leafs structure is changed as follow:

|    key part    |
|----------------|
| link | mvccVer |

link - link to the data
mvccVer - XID of transaction who created the row

Data row payload structure is changed as follow:

|              |           |         |            |          |           |             |             |             |
| payload size | next_link | mvccVer | newMvccVer | cache_id | key_bytes | value_bytes | row_version | expire_time |
|              |           |         |            |          |           |             |             |             |

mvccVer - TX id which created this row.
newMvccVer - TX id which updated this row or NA in this is the last row version (need to decide whether the row is visible for current reader).

other fields are obvious.

Locks

During DML or SELECT FOR UPDATE tx aquires locks one by one.

If the row is locked by another tx, current tx saves the context (cursor and current position in it) and register itself as a tx state listener As soon as previous tx is committed or rolled back it fires an event. This means all locks, acquired by this tx, are released. So, waiting on locked row tx is notified and continues locking.

TxLog is used to determine lock state, if tx with XID equal to row 'xid' field (see BTree leafs structure) is active, the row is locked by this TX. All newly created rows have 'xid' field value the same as 'ver' field value. Since, as was described above, rows with the same key are placed from newest to oldest, we can determine lock state checking the first version of row only.

Transactional Protocol Changes

All the changes are written into cache (disk) to be visible for subsequent queries/scans in scope of transaction.

2PC (Two Phase Commit) is used for commit procedure but has no TX entries (all the changes are already in cache), it is needed to keep TxLog consistent on all data nodes (tx participants).

Recovery Protocol Changes

There are several participant roles:

MVCC coordinator
TX coorinator
Primary data node
Backup datanode

Each participant may have several roles at the same time.

So, there are steps to recover each type of participant:

On MVCC coordinator failure:

A new coordinator is elected (the oldest server node, may be some additional filters)
During exchange each node sends its active txs.
The new coordinator merges chunks and restores its active Tx list.
After merge is done it continues versions requests processing.

On TX coordinator failure:

Each tx node sends to MVCC coordinator Tx nodes list.
Local tx state is checked
1. If tx in COMMITTED state it adds a local TxLog record with XID and COMMITTED flag and sends a tx committed message to MVCC coordinator (Tx is remooved from the local active Tx list, all locks are released at this point).
2. If tx in ABORTED state it adds a local TxLog record with XID and ABORTED flag and sends a tx rolled back message to MVCC coordinator (Tx is remooved from the local active Tx list, all locks are released at this point).
3. If tx in ACTIVE state on the node it adds a local TxLog record with XID and ABORTED flag and sends a tx rolled back message to MVCC coordinator (Tx is remooved from the local active Tx list, all locks are released at this point).
Other tx participants are checked.
1. If there is at least one node with COMMITTED state it adds a local TxLog record with XID and COMMITTED flag and sends a tx committed message to MVCC coordinator (Tx is remooved from the local active Tx list, all locks are released at this point).
2. If there is at least one node with ABORTED state it adds a local TxLog record with XID and ABORTED flag and sends a tx rolled back message to MVCC coordinator (Tx is remooved from the local active Tx list, all locks are released at this point).
3. If there is at least one node with tx in ACTIVE state it adds a local TxLog record with XID and ABORTED flag and sends a tx rolled back message to MVCC coordinator (Tx is remooved from the local active Tx list, all locks are released at this point).
When MVCC coordinator recieves messages from all Tx participants, Tx is remooved from the active Tx list, all the changes become visible.

There is a special case when Tx coordinator fales right before tx committed message sending

If MVCC coordinator doesn't recieve Tx nodes list from any Tx participant it broadcasts a Tx check message.
When MVCC coordinator recieves acknowledges from all Tx participants, Tx is remooved from the active Tx list, all the changes become visible.

On primary data node failure

If primary node fails during update we may apply some strategy and whether retry statement ignoring previous changes (using cid) or rollback tx.

if primary node fails during prepare we check whether partitions have not been lost and continue commit procedure in case all partitions are still available.

On loss of partition

A local TxLog record with XID and IN_DOUBT state is added on each tx participant. These records are preserved until all data owners rejoins the cluster. Such txs are considered as active for all readers.

On tx datanode rejoin

In case there is a tx IN_DOUBT state on for rejoining node and all data is awailable now, tx is rolled back and removed from the active Tx list on all nodes.

Read (getAll and SQL)

Each read operation outside active transaction creates a special read only transaction and uses its tx snapshot for versions filtering.

RO transactions are added to active Tx lists on reader (near) node and MVCC coordinator.

On reaer node failure all RO txs are removed from the active Tx list on MVCC coordinator.

Each read operation within active transaction uses its tx snapshot for versions filtering (REPEATABLE_READ semantics).

During get operation the first passing MVCC filter item is returned.

During secondary indexes scans 'ver' field of tree item is checked, if row version is visible (the row was added by current or committed tx) 'xid_max' field of referenced data row is checked - the row considered as visible if it is the last version of row 'xid_max' is NA or ACTIVE or ABORTED or higher than assigned.

During primary indexes scans 'ver' field of tree item is checked, if row version is visible (the row was added by current or committed tx) 'xid_max' field of referenced data row is checked - the row considered as visible if it is the last version of row 'xid_max' is NA or ACTIVE or ABORTED or higher than assigned.

During primary indexes scans 'ver' field of tree item is checked, the first passing MVCC filter item is returned, all next versions of row are skipped.

Update (putAll and DML)

Update consist of next steps:

obitain a lock (write current version into 'xid' field)
add a row with new version.
delete aborted versions of row if exist (may be omitted for performance reasons)
update xid_max of previous committed row if exist
delete previous committed versions (less or equal to cleanup version of tx snapshot) if exist (may be omitted for performance reasons)
delete previous committed versions (less or equal to cleanup version of tx snapshot) if exist from all secondary indexes (may be omitted for performance reasons)
add a row with new version to all secondary indexes.

Delete consists of next steps:

obitain a lock (write current version into 'xid' field)
delete aborted versions of row if exist (may be omitted for performance reasons)
update xid_max of previous committed row if exist
delete previous committed versions (less or equal to cleanup version of tx snapshot) if exist (may be omitted for performance reasons)

Page tree

General Approach Overview

Version Coordinator(s)

TxLog

Internal Data Structures Changes

Locks

Transactional Protocol Changes

Recovery Protocol Changes

On MVCC coordinator failure:

On TX coordinator failure:

There is a special case when Tx coordinator fales right before tx committed message sending

On primary data node failure

On loss of partition

On tx datanode rejoin

Read (getAll and SQL)

Update (putAll and DML)

Cleanup of old versions

Page tree

Distributed MVCC And Transactional SQL Design

General Approach Overview

Version Coordinator(s)

TxLog

Internal Data Structures Changes

Locks

Transactional Protocol Changes

Recovery Protocol Changes

On MVCC coordinator failure:

On TX coordinator failure:

There is a special case when Tx coordinator fales right before tx committed message sending

On primary data node failure

On loss of partition

On tx datanode rejoin

Read (getAll and SQL)

Update (putAll and DML)

Cleanup of old versions