Problem Description And Test Case

In Ignite 1.x implementation general reads performed out-of-transaction (such as getAll() or SQL SELECT) do not respect transaction boundaries. This problem is two-fold. First, local node transaction visibility is not atomic with respect to multi-entry read. Committed entry version is made visible immediately after entry is updated. Second, there is no visible version coordination when a read involves multiple nodes. Thus, even if local transaction visibility is made atomic, this does not solve the issue.

The problem can be easily described using a test case. Let's say we have a bank system with a fixed number of accounts and we continuously run random money transfers between random pairs of accounts. In this case the sum of account balances is a system invariant and must be the same for any getAll() or SQL query.

General Approach Overview

The main idea is that every node should store not only the current (last) entry value, but also some number of previous values in order to allow consistent distributed reads. To do this, we need to introduce a separate node role - transaction version coordinators - which will be responsible for assigning a monotonically growing transaction version as well as maintaining versions of in-progress transactions and in-progress reads. The last committed transaction ID and IDs of pending transactions define the versions that should be visible for any subsequent read. The IDs of pending reads defines the value versions that are no longer needed and can be discarded.

Version Coordinator(s)

In the initial version of distributed MVCC we will use single transaction coordinator that will define the global transaction order for all transactions in the cluster. The coordinator may be a dedicated node in the cluster. Upon version coordinator failure a new coordinator should be elected in such a way that the new coordinator will start assigning new versions (tx XIDs as well) that is guaranteed to be greater than all previous transaction versions (using two longs: coordinator version which is a topology major version and starting from zero counter). To be able to restore all tx states on cluster restart or coordinator failure a special structure (TxLog) is introduced. TxLog is a table (can be persistent in case persistence enabled) which contains XID to transactions states [active, preparing, committed, aborted, etc] mappings.

Only MVCC coordinator has the whole table, other nodes have TxLog subset relaited to node-local TXs.

On MVCC coordinator failure new coordinator collects and merges all TxLog subsets from other nodes, after that it starts. At this time MVCC counter cannot be assigned or acknowleged, so that all new and committing TXs are waiting for the operation is completed.

Internal Data Structures Changes

BTree leafs structure is changed as follow:

|           key part          |       |         |        |
|-----------------------------|  xid  |  flags  |  link  |
| cache_id | hash | ver | cid |       |         |        |

cache_id - cache ID if it is a cache in a cache group
hash - key hash
ver - XID of transaction who created the row
xid - xid of transaction who holds a lock on the row
cid - operation counter, the number of operation in transaction this row was changed by.
flags - allows to fast check whether the row visible or not
link - link to the data

Rows with the same key are placed from newest to oldest.

Index BTree leafs structure is changed as follow:

|     key part     |       |
|------------------| flags |
| link | ver | cid |       |

link - link to the data
ver - XID of transaction who created the row
cid - operation counter, the number of operation in tx this row was changed by.
flags - allows to fast check whether the row visible or not

Data row payload structure is changed as follow:

|              |          |         |         |           |             |             |             |
| payload size | cache_id | xid_min | xid_max | key_bytes | value_bytes | row_version | expire_time |
|              |          |         |         |           |             |             |             |

xid_min - TX id which created this row.
xid_max - TX id which updated this row or NA in this is the last row version (used during secondary index scans).

other fields are obvious.

During scans 'ver' field is checked, if row version is visible (the row was added by current or committed tx) 'xid_max' field of referenced data row is checked - the row considered as visible if it is the last version of row ('xid_max' is NA)

Locks

During DML or SELECT FOR UPDATE tx aquires locks one by one.

If the row is locked by another tx, current tx saves the context (cursor and current position in it) and register itself as a tx state listener As soon as previous tx is committed or rolled back it fires an event. This means all locks, acquired by this tx, are released. So, waiting on locked row tx is notified and continues locking.

TxLog is used to determine lock state, if tx with XID equal to row 'xid' field (see BTree leafs structure) is active, the row is locked by this TX. All newly created rows have 'xid' field value the same as 'ver' field value. Since, as was described above, rows with the same key are placed from newest to oldest, we can determine lock state checking the first version of row only.

Transactional Protocol Changes

Commit Protocol Changes

Commit

Wen a tx is started a new version is assighned and MVCC coordinator adds a local TxLog record with XID and ACTIVE flag
the first change request to a datanode within the transaction produces a local TxLog record with XID and ACTIVE flag at the data node.
at the commit stage each tx node adds a local TxLog record with XID and LOCALLY_COMMITTED flag and sends an acknowledge to TX coordinator
TX coordinator sends to MVCC coordinator node a tx committed message.
MVCC coordinator adds TxLog record with XID and COMMITTED flag, all the changes become visible.
TX coordinator sends to participants a commit acknowledged message, nodes asyncronously mark tx as COMMITTED.

An error during commit

Wen a tx is started a new version is assighned and MVCC coordinator adds a local TxLog record with XID and ACTIVE flag
the first change request to a datanode within the transaction produces a local TxLog record with XID and ACTIVE flag at the data node.
at the commit stage each tx node adds a local TxLog record with XID and LOCALLY_COMMITTED flag and sends an acknowledge to TX coordinator
In case at least one participant does not confirm commit, TX coordinator sends to each participant rollback message
each tx node adds a local TxLog record with XID and ABORTED flag and sends an acknowledge to TX coordinator, all the locks became released.
TX coordinator sends to MVCC coordinator node a tx rolled back message.
MVCC coordinator adds TxLog record with XID and ABORTED flag.

Rollback

Wen a tx is started a new version is assighned and MVCC coordinator adds a local TxLog record with XID and ACTIVE flag
the first change request to a datanode within the transaction produces a local TxLog record with XID and ACTIVE flag at the data node.
at the rollback stage each tx node adds a local TxLog record with XID and ABORTED flag and sends an acknowledge to TX coordinator
TX coordinator sends to MVCC coordinator node a tx rolled back message.
MVCC coordinator adds TxLog record with XID and ABORTED flag.

In case MVCC coordinator fails, newly assighned coordinator gets all TxLog subsets and checks that all LOCALLY_COMMITTED tx participants (at least one node from each primary to backup partition mapping) are alive and mark TX as COMMITTED (and sends to participants a commit acknowledged message, nodes asyncronously mark tx as COMMITTED) or LOCALLY_COMMITTED otherwise.

In case there is at least one node with tx in COMMITTED or ABORTED state the transaction marked as COMMITTED or ABORTED on MVCC coordinator and all tx nodes without tx node map check.

At an disconnected node rejoin time the LOCALLY_COMMITTED txs (at mvcc coordinator or rejoining node) are rechecked.

LOCALLY_COMMITTED txs are in active state for all other participants, they retain all aquired locks and in case there are no quorum (one of the members failed permanently) has to be resolved manually.

Recovery Protocol Changes

Read (getAll and SQL changes)

Page tree

Distributed MVCC And Transactional SQL Design