Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Problem Description And Test Case

In Ignite 1.x implementation general reads performed out-of-transaction (such as getAll() or SQL SELECT) do not respect transaction boundaries. This problem is two-fold. First, local node transaction visibility is not atomic with respect to multi-entry read. Committed entry version is made visible immediately after entry is updated. Second, there is no visible version coordination when a read involves multiple nodes. Thus, even if local transaction visibility is made atomic, this does not solve the issue.

...

In the initial version of distributed MVCC we will use single transaction coordinator that will define the global transaction order for all transactions in the cluster. The coordinator may be a dedicated node in the cluster. Upon version coordinator failure a new coordinator should be elected in such a way that the new coordinator will start assigning new versions (tx XIDs as well) that is guaranteed to be greater than all previous transaction versions (using two longs: coordinator version which is a topology major version and starting from zero counter). 

TxLog

To be able to restore all tx states on cluster restart or coordinator failure determine state of TX, which has changed a particular row, a special structure (TxLog) is introduced. TxLog is  TxLog is a table (can be persistent in case persistence enabled) which contains XID to transactions states [active, preparing, committed, aborted, etc] mappings.

Only MVCC coordinator has the whole table, other nodes have TxLog subset relaited to node-local TXs.

...

MVCC version to transaction state mappings.

TxLog is used to keep all the data consistent on cluster crush and recovery:

  • If a particular Tx has ACTIVE or ROLLED_BACK state on at least one data node it marks as ROLLED_BACK on all nodes. 
  • If a particular Tx has COMMITTED state on at least one data node it marks as COMMITTED on all nodes.
  • If a particular Tx has PREPARED state on all data nodes and all involved partitions are available it marks as COMMITTED on all nodes.
  • If a particular Tx has PREPARED state on all data nodes and at least one involved partition is lost (unavailable) it is left in PREPARED state, all the entries are locked for updates until state is changed manually or lost partition become available.

Internal Data Structures Changes

...

Code Block
languagetext
|           key part          |       |       |  |        |
|-----------------------------|  xidlockVer  |  flags  |  link  |
| cache_id | hash | ver |mvccVer cid |       |    |     |        |

 

cache_id - cache ID if it is a cache in a cache group
hash - key hash
ver mvccVer - XID MVCC version of transaction who created the row
xid lockVer - xid of MVCC version of transaction who holds a lock on the row
cid - operation counter, the number of operation in transaction this row was changed by.
flags - allows to fast check whether the row visible or not
link - link to the data

Rows with the same key are placed from newest to oldest.

Index BTree leafs structure is changed as follow:

Code Block
languagetext
|     key part     |       |
|------------------| flags |
| link | ver | cidmvccVer |       |

 

link - link to the data
ver mvccVer - XID of transaction who created the row
cid - operation counter, the number of operation in tx this row was changed by.
flags - allows to fast check whether the row visible or not

Data row payload structure is changed as follow:

Code Block
languagetext
|              |           |         |            |          |           |             |             |             |
| payload size | next_link | xid_minmvccVer | xid_maxnewMvccVer | cache_id | key_bytes | value_bytes | row_version | expire_time |
|              |           |         |            |          |           |             |             |             |

 

xid_min mvccVer - TX id which created this row.
xid_max newMvccVer - TX id which updated this row or NA in this is the last row version (used during secondary index scansneed to decide whether the row is visible for current reader).

other fields are obvious.

...

Transactional Protocol Changes

Commit Protocol Changes

Commit

  1. When a Tx is started a new version is assighned and MVCC coordinator adds a record with XID to its active Tx list.
  2. The first change request to a datanode within the transaction the Tx is added to the local active Tx list.
  3. TX coordinator sends to each datanode node a tx prepare message.
  4. Each tx node adds a local TxLog record with XID and PREPARED flag and sends an acknowledge to Tx coordinator.
  5. TX coordinator sends to each datanode node a tx finish message, 
  6. Each tx node adds a local TxLog record with XID and COMMITTED flag and sends an acknowledge to Tx coordinator (Tx is remooved from the local active Tx list, all locks are released at this point).
  7. TX coordinator sends to MVCC coordinator Tx committed message, Tx is remooved from the active Tx list, all the changes become visible.

An error during commit

  1. When a tx is started a new version is assighned and MVCC coordinator adds a record with XID to its active Tx list.
  2. The first change request to a datanode within the transaction the Tx is added to the local active Tx list.
  3. TX coordinator sends to each datanode node a tx prepare message.
  4. Each tx node adds a local TxLog record with XID and PREPARED flag and sends an acknowledge to Tx coordinator.
  5. In case at least one participant does not confirm commit, TX coordinator sends to each participant a tx rollback message.
  6. Each tx node adds a local TxLog record with XID and ABORTED flag and sends an acknowledge to TX coordinator (Tx is remooved from the local active Tx list, all locks are released at this point).
  7. TX coordinator sends to MVCC coordinator node a tx rolled back message, Tx is remooved from the active Tx list.

Rollback

...

All the changes are written into cache (disk) to be visible for subsequent queries/scans in scope of transaction.

2PC (Two Phase Commit) is used for commit procedure but has no TX entries (all the changes are already in cache), it is needed to keep TxLog consistent on all data nodes (tx participants)

...

.

Recovery Protocol Changes

...