Problem Description And Test Case

In Ignite 1.x implementation general reads performed out-of-transaction (such as getAll() or SQL SELECT) do not respect transaction boundaries. This problem is two-fold. First, local node transaction visibility is not atomic with respect to multi-entry read. Committed entry version is made visible immediately after entry is updated. Second, there is no visible version coordination when a read involves multiple nodes. Thus, even if local transaction visibility is made atomic, this does not solve the issue.

In the initial version of distributed MVCC we will use single transaction coordinator that will define the global transaction order for all transactions in the cluster. The coordinator may be a dedicated node in the cluster. Upon version coordinator failure a new coordinator should be elected in such a way that the new coordinator will start assigning new versions (tx XIDs as well) that is guaranteed to be greater than all previous transaction versions (using two longs: coordinator version which is a topology major version and starting from zero counter).

TxLog

To be able to restore all tx states on cluster restart or coordinator failure determine a state of Tx that created a particular row, a special structure (TxLog) is introduced. TxLog is TxLog is a table (can be persistent in case persistence enabled) which contains XID to transactions states [active, preparing, committed, aborted, etc] mappings.

Only MVCC coordinator has the whole table, other nodes have TxLog subset relaited to node-local TXs.

...

MVCC version to transaction state mappings.

TxLog is used to keep all the data consistent on cluster crush and recovery as well:

If a particular Tx has ACTIVE or ROLLED_BACK state on at least one data node it marks as ROLLED_BACK on all nodes.
If a particular Tx has COMMITTED state on at least one data node it marks as COMMITTED on all nodes.
If a particular Tx has PREPARED state on all data nodes and all involved partitions are available it marks as COMMITTED on all nodes.
If a particular Tx has PREPARED state on all data nodes and at least one involved partition is lost (unavailable) it is left in PREPARED state, all the entries are locked for updates until state is changed manually or lost partition become available.

Internal Data Structures Changes

BTree leafs structure is changed as follow:

Code Block

language	text

|           key part          |       |     |    |        |
|-----------------------------|  xidlockVer  |  flags  |  link  |
| cache_id | hash | ver |mvccVer cid |       |        | |        |

cache_id - cache ID if it is a cache in a cache group
hash - key hash
ver - XID of transaction who mvccVer - MVCC version of transaction which has created the row
xid lockVer - xid of transaction who holds MVCC version of transaction which holds a lock on the rowcid - operation counter, the number of operation in transaction this row was changed by.
flags - allows to fast check whether the row visible or not
link - link to the data

other fields are obvious.

Rows with the same key are placed from newest to oldest.

Index BTree leafs structure is changed as follow:

Code Block

language	text

|     key part     |       |
|------------------| flags |
| link | ver | cidmvccVer |       |

...

link - link to the data
ver mvccVer - XID of transaction who created the rowcid - operation counter, the number of operation in tx this row was changed by.
flags - allows to fast check whether the row visible or not

Data row payload structure is changed as follow:

Code Block

language	text

|              |           |         |            |          |           |             |             |             |
| payload size | cache_idnext_link | mvccVer | xid_minnewMvccVer | xidcache_maxid | key_bytes | value_bytes | row_version | expire_time |
|              |           |         |            |          |           |             |             |             |

xid_min mvccVer - TX id which created this row.
xid_max newMvccVer - TX id which updated this row or NA in this is the last row version (used during secondary index scansneed to decide whether the row is visible for current reader).

other fields are obvious.

Locks

During DML or SELECT FOR UPDATE tx aquires UPDATE statements Tx acquires locks one by one.

If the row is locked by another tx, current tx saves the context (cursor and current position in it) and register itself as a tx Tx state listener. As soon as previous tx Tx is committed or rolled back it fires an event. This means all locks, which are acquired by this txTx, are released. So, waiting on locked row tx Tx is notified and continues locking/writing.

TxLog is used to determine lock state, if tx Tx with XID MVCC version equal to row 'xid' field lock version (see BTree leafs structure) is active, the row is locked by this TX. All newly created rows have lock version the same as its MVCC version, so, all newly created rows are locked by Tx, in scope of which they was created.

'xid' field value the same as 'ver' field value. Since, as was described above, rows with the same key are placed from newest to oldest, we can determine lock state checking the first version of row only.

Transactional Protocol Changes

Commit Protocol Changes

Commit

When a tx is started a new version is assighned and MVCC coordinator adds a local TxLog record with XID and ACTIVE flag
the first change request to a datanode within the transaction produces a local TxLog record with XID and ACTIVE flag at the data node.
at the commit stage each tx node adds a local TxLog record with XID and PREPARED flag and sends an acknowledge to TX coordinator
TX coordinator sends to MVCC coordinator node a tx committed message.
MVCC coordinator adds TxLog record with XID and COMMITTED flag, all the changes become visible.
MVCC coordinator sends to participants a commit acknowledged message, all tx datanodes mark tx as COMMITTED, all resources are released.

Note: since commit acknowledge is processed asynchronously, tx which is not active in tx snapshot but at PREPARED state in local TxLog (during read operation) is considered and marked as COMMITTED.

An error during commit

When a tx is started a new version is assighned and MVCC coordinator adds a local TxLog record with XID and ACTIVE flag
the first change request to a datanode within the transaction produces a local TxLog record with XID and ACTIVE flag at the data node.
at the commit stage each tx node adds a local TxLog record with XID and PREPARED flag and sends an acknowledge to TX coordinator
In case at least one participant does not confirm commit, TX coordinator sends to each participant rollback message.
each tx node adds a local TxLog record with XID and ABORTED flag and sends an acknowledge to TX coordinator.
TX coordinator sends to MVCC coordinator node a tx rolled back message.
MVCC coordinator adds TxLog record with XID and ABORTED flag.
MVCC coordinator sends to participants a rollback acknowledged message, all resources are released.

Rollback

When a tx is started a new version is assighned and MVCC coordinator adds a local TxLog record with XID and ACTIVE flag
the first change request to a datanode within the transaction produces a local TxLog record with XID and ACTIVE flag at the data node.
at the rollback stage each tx node adds a local TxLog record with XID and ABORTED flag and sends an acknowledge to TX coordinator
TX coordinator sends to MVCC coordinator node a tx rolled back message.
MVCC coordinator adds TxLog record with XID and ABORTED flag.
MVCC coordinator sends to participants a rollback acknowledged message, all resources are released.

Recovery Protocol Changes

There are several participant roles:

MVCC coordinator
TX coorinator
Primary data node
Backup datanode

Each participant may have several roles at the same time.

So, there are steps to recover each type of participant:

On MVCC coordinator failure:

A new coordinator is elected (the oldest server node, may be some additional filters)
During exchange each node sends its TxLog
The new coordinator merges all the TxLog chunks and checks all local states for each TX. In case data nodes have state conflicts next rules are used:
1. if there is at least one node with TX in ABORTED state tx rollback message is send to all datanodes and whole TX is marked as ABORTED.
2. if there is at least one node with TX in COMMITTED state whole TX is marked as COMMITTED and commit acknowledged message is send to all datanodes.
3. if all datanodes have TX in PREPARED state whole TX is marked as COMMITTED and commit acknowledged message is send to all datanodes.
4. TX cannot be in COMMITTED and ABORTED state at the same time on different nodes. In case a node cannot mark PREPARED tx as COMMITTED this node has to be forcibly stopped.
After merge is done it continues versions requests processing.

On TX coordinator failure:

A new coordinator is elected (an oldest server tx datanode it becames a Tx coordinator)
1. in case the oldest server tx datanode has already finished the tx and released resources, nothing happens (that means that MVCC coordinator whether started acknowleging or failed and tx will be recovered during MVCC coordinator recovery)
A new coordinator checks other nodes:
1. In case at least one nodes has already finished the tx and released resources it does nothing (that means that MVCC coordinator whether started acknowleging or failed and tx will be recovered during MVCC coordinator recovery)
In case the new coordinator has the transaction in ACTIVE state
1. A tx rollback message is send to all tx data nodes.
2. A tx rolled back message is send to MVCC coordinator node.
3. MVCC coordinator sends to participants a rollback acknowledged message, all resources are released.
In case the new coordinator has the transaction in COMMITTED state
1. A tx committed message is send to MVCC coordinator node.
2. MVCC coordinator sends to participants a commit acknowledged message, all tx datanodes mark tx as COMMITTED, all resources are released.
In case the new coordinator has the transaction in ABORTED state:
1. A tx rollback message is send to all tx data nodes.
2. A tx rolled back message is send to MVCC coordinator node.
3. MVCC coordinator sends to participants a rollback acknowledged message, all resources are released.
In case the new coordinator has the transaction in PREPARED state
1. A new coordinator checks all tx data nodes
2. In case all participants have the tx in PREPARED state or at least one tx data node has the tx in COMMITTED state:
  1. A tx committed message is send to MVCC coordinator node.
  2. MVCC coordinator sends to participants a commit acknowledged message, all tx datanodes mark tx as COMMITTED, all resources are released.
3. In case at least one datanode has tx in ABORTED state:
  1. A tx rollback message is send to all tx data nodes.
  2. A tx rolled back message is send to MVCC coordinator node.
  3. MVCC coordinator sends to participants a rollback acknowledged message, all resources are released.

On primary data node failure

If primary node fails during update we may apply some strategy and whether retry statement ignoring previous changes (using cid) or rollback tx.

if primary node fails during prepare we check whether partitions have not been lost and continue commit procedure in case all partitions are still available.

On loss of partition

We add a special meta record (let's say TxDataLostRecord) with tx, node id and list of lost partitions.

This means that we cannot clean out tx info with XID heigher than the lowest XID from the list.

The transaction during which the partition was lost will be aborted or committed in case PREPARE stage has been already compleeted.

The transaction will be fully finished on partition return or manually (deleting corresponding TxDataLostRecord) in case the partition is lost eternally.

During further rebalances nodes in addition to rows sends TxDataLostRecords and tx state records to finish corresponding transactions properly on failed node rejoin.

On tx datanode rejoin

In case there is a TxDataLostRecord for rejoining node, partitions from the record will not be invalidated or rebalanced.

Corresponding transactions are finished on all nodes (using tx->partitions map from tx metadata), TxDataLostRecords are deleted.

Read (getAll and SQL)

Each read operation outside active transaction creates a special read only transaction and uses its tx snapshot for versions filtering.

Each read operation within active READ_COMMITTED transaction creates a special read only transaction and uses its tx snapshot for versions filtering.

Each read operation within active REPEATABLE_READ transaction uses its tx snapshot for versions filtering.

During get operation the first passing MVCC filter item is returned.

During secondary indexes scans 'ver' field of tree item is checked, if row version is visible (the row was added by current or committed tx) 'xid_max' field of referenced data row is checked - the row considered as visible if it is the last version of row 'xid_max' is NA or ACTIVE or ABORTED or higher than assigned.

During primary indexes scans 'ver' field of tree item is checked, the first passing MVCC filter item is returned, all next versions of row are skipped.

All the changes are written into cache (disk) at once to be visible for subsequent queries/scans in scope of transaction.

Two Phase Commit is used for commit procedure but has no Tx entries (all the changes are already in cache), it is needed just to keep TxLog consistent on all data nodes (Tx participants).

Near Tx node has to to notify Version Coordinator about final Tx state to make changes visible for subsequent reads.

Version Coordinator recovery

When MVCC coordinator node fails, a new one is elected among the live nodes – usually the oldest one.

The main goal of the MVCC coordinator failover is to restore an internal state of the previous coordinator in the new one. The internal state of MVCC coordinator consists of two main parts:

Active transactions list.
Active queries list.

Due to Ignite partition map exchange design all write transactions should be finished before topology version is changed. Therefore there is no need to restore active transactions list on the new coordinator because all old transactions are either committed or rolled back during topology changing.

The only thing we have to do – is to recover the active queries list. We need this list to avoid old versions cleanup when there are any old queries are running over this old data because it could lead to query result inconsistency. When all old queries are done we can safely continue cleanup old versions.

To restore active queries at the new coordinator the MvccQueryTracker object was introduced. Each tracker is associated with a single query. The purpose of the tracker is:

To mark each query with an unique id for a solid query tracking.
To hold a query MVCC snapshot.
To report to the new MVCC coordinator about the associated active query in case of old coordinator failure.
To send acks to the new coordinator when the associated query is completed.

Active queries list recovery on the new coordinator looks as follows:

When old coordinator fails, an exchange process started and the new coordinator is elected.
During this process each node sends a list of active query trackers to the new coordinator.
New coordinator combine all those lists to the global one.
When an old query finishes, the associated query tracker sends an ack to the new coordinator.
Coordinator removes this tracker from the global list when ack is received.
When global list becomes empty, this means that all old queries are done and we do not have to hold old date versions in our store – cleanup process begins.

Read (get and SQL)

Each read operation outside an active transaction or in scope of an optimistic transaction gets or uses a previously received Query Snapshot (which considered as read version for optimistic Tx. Note: optimistic transactions cannot be used in scope of DML operations).

All requested snapshots are tracked on Version Coordinator to prevent cleaning up the rows are read.

All received snapshots are tracked on local node for Version Coordinator recovery needs.

Query Snapshot is used for versions filtering (REPEATABLE_READ semantics).

Each read operation in scope of active pessimistic Tx uses its (transaction) snapshot for versions filtering (REPEATABLE_READ semantics).

On failure the node, which requested a Query Snapshot but not sent QueryDone message to Version Coordinator, such snapshot is removed from active queries map. Rows, which are not visible for all other readers, become available for cleaning up.

The row is considered as visible for read operation when it has visible (COMMITTED and in past) MVCC version (create version) and invisible (ACTIVE or ROLLED_BACK or in future) new MVCC version (update version).

Update (put

...

and DML)

Update consist of next steps:

obitain a obtain a lock (write current version into 'xid' field) lockVer field)
delete aborted versions of row if exist (may be omitted for performance reasons)
delete previous committed versions (less or equal to cleanup version of Tx snapshot) if exist (may be omitted for performance reasons)
delete previous committed versions (less or equal to cleanup version of Tx snapshot) if exist from all secondary indexes (may be omitted for performance reasons)
update new MVCC version of previous committed row if exist (newMvccVer field)
add a row with new version
add a row with new version to all secondary indexes.

Delete consists of next steps:

obtain a lock (write current version into lockVer field)
delete aborted versions of row if exist (may be omitted for performance reasons)update xid_max of previous committed row if exist
delete previous committed versions (less or equal to cleanup version of tx snapshot) if exist (may be omitted for performance reasons)
delete previous committed versions (less or equal to cleanup version of tx snapshot) if exist exist from all secondary indexes (may be omitted for performance reasons)
update new MVCC version of previous committed row if exist (newMvccVer field). The row become invisible after Tx commit

Cleanup of old versions

Invisible for all readers rows are cleaned up by writers, as was described above, or by Vacuum procedure (by analogy with PostgreSQL).

During Vacuum all checked rows (which are still visible for at least one reader) are actualized with TxLog by setting special hint bits (most significant bits in MVCC operation counter) which show the state of Tx that created the row.

After all rows are processed, corresponding TxLog records can be deleted as well.

Page tree

Versions Compared

Old Version 19

New Version Current

Key

TxLog

Internal Data Structures Changes

BTree leafs structure is changed as follow:

Index BTree leafs structure is changed as follow:

Data row payload structure is changed as follow:

Locks

Transactional Protocol Changes

Commit Protocol Changes

Recovery Protocol Changes

On MVCC coordinator failure:

On TX coordinator failure:

On primary data node failure

On loss of partition

On tx datanode rejoin

Read (getAll and SQL)

Version Coordinator recovery

Read (get and SQL)

Update (put

and DML)

Update consist of next steps:

Delete consists of next steps:

Cleanup of old versions

Page tree

Page History

Versions Compared

Old Version 19

New Version Current

Key

TxLog

Internal Data Structures Changes

BTree leafs structure is changed as follow:

Index BTree leafs structure is changed as follow:

Data row payload structure is changed as follow:

Locks

Transactional Protocol Changes

Commit Protocol Changes

Recovery Protocol Changes

On MVCC coordinator failure:

On TX coordinator failure:

On primary data node failure

On loss of partition

On tx datanode rejoin

Read (getAll and SQL)

Version Coordinator recovery

Read (get and SQL)

Update (put

and DML)

Update consist of next steps:

Delete consists of next steps:

Cleanup of old versions