Problem Description And Test Case
In Ignite 1.x implementation general reads performed out-of-transaction (such as getAll() or SQL SELECT) do not respect transaction boundaries. This problem is two-fold. First, local node transaction visibility is not atomic with respect to multi-entry read. Committed entry version is made visible immediately after entry is updated. Second, there is no visible version coordination when a read involves multiple nodes. Thus, even if local transaction visibility is made atomic, this does not solve the issue.
...
In the initial version of distributed MVCC we will use single transaction coordinator that will define the global transaction order for all transactions in the cluster. The coordinator may be a dedicated node in the cluster. Upon version coordinator failure a new coordinator should be elected in such a way that the new coordinator will start assigning new versions (tx XIDs as well) that is guaranteed to be greater than all previous transaction versions (using two longs: coordinator version which is a topology major version and starting from zero counter).
To be able to restore all tx states on cluster restart or coordinator failure determine a state of Tx that created a particular row, a special structure (TxLog) is introduced. TxLog is TxLog is a table (can be persistent in case persistence enabled) which contains XID to transactions states [active, preparing, committed, aborted, etc] mappings.
Only MVCC coordinator has the whole table, other nodes have TxLog subset relaited to node-local TXs.
...
MVCC version to transaction state mappings.
TxLog is used to keep all the data consistent on cluster crush and recovery as well:
Code Block | ||
---|---|---|
| ||
| key part | | | | | |-----------------------------| xidlockVer | flags | link | | cache_id | hash | ver |mvccVer cid | | | | | |
cache_id - cache ID if it is a cache in a cache group
hash - key hash
ver - XID of transaction who mvccVer - MVCC version of transaction which has created the row
xid lockVer - xid of transaction who holds MVCC version of transaction which holds a lock on the rowcid - operation counter, the number of operation in transaction this row was changed by.
flags - allows to fast check whether the row visible or not
link - link to the data
other fields are obvious.
Rows with the same key are placed from newest to oldest.
Code Block | ||
---|---|---|
| ||
| key part | | |------------------| flags | | link | ver | cid | mvccVer | |
...
link - link to the data
ver mvccVer - XID of transaction who created the row
cid - operation counter, the number of operation in tx this row was changed by.
flags - allows to fast check whether the row visible or not
Code Block | ||
---|---|---|
| ||
| | | | | | | | | | | payload size | cache_idnext_link | mvccVer | xid_minnewMvccVer | xidcache_maxid | key_bytes | value_bytes | row_version | expire_time | | | | | | | | | | | |
xid_min mvccVer - TX id which created this row.
xid_max newMvccVer - TX id which updated this row or NA in this is the last row version (used during secondary index scansneed to decide whether the row is visible for current reader).
other fields are obvious.
During scans 'ver' field is checked, if row version is visible (the row was added by current or committed tx) 'xid_max' field of referenced data row is checked - the row considered as visible if it is the last version of row ('xid_max' is NA)
During DML or SELECT FOR UPDATE tx aquires UPDATE statements Tx acquires locks one by one.
If the row is locked by another tx, current tx saves the context (cursor and current position in it) and register itself as a tx Tx state listener. As soon as previous tx Tx is committed or rolled back it fires an event. This means all locks, which are acquired by this txTx, are released. So, waiting on locked row tx Tx is notified and continues locking/writing.
TxLog is used to determine lock state, if tx Tx with XID MVCC version equal to row 'xid' field lock version (see BTree leafs structure) is active, the row is locked by this TX. All newly created rows have lock version the same as its MVCC version, so, all newly created rows are locked by Tx, in scope of which they was created.
'xid' field value the same as 'ver' field value. Since, as was described above, rows with the same key are placed from newest to oldest, we can determine lock state checking the first version of row only.
Commit
An error during commit
Rollback
In case MVCC coordinator fails, newly assighned coordinator gets all TxLog subsets and checks that all LOCALLY_COMMITTED tx participants (at least one node from each primary to backup partition mapping) are alive and mark TX as COMMITTED (and sends to participants a commit acknowledged message, nodes asyncronously mark tx as COMMITTED) or LOCALLY_COMMITTED otherwise.
In case there is at least one node with tx in COMMITTED or ABORTED state the transaction marked as COMMITTED or ABORTED on MVCC coordinator and all tx nodes without tx node map check.
LOCALLY_COMMITTED txs are in active state for all other participants, they retain all aquired locks and in case there are no quorum (one of the members failed permanently) has to be resolved manually.
Since all the changes are written right after lock is aquired, all the participants are ready to commit changes after each successful data change.
We can use next rules for tx recovery on TX coordinator failure:
...
All the changes are written into cache (disk) at once to be visible for subsequent queries/scans in scope of transaction.
Two Phase Commit is used for commit procedure but has no Tx entries (all the changes are already in cache), it is needed just to keep TxLog consistent on all data nodes (Tx participants).
Near Tx node has to to notify Version Coordinator about final Tx state to make changes visible for subsequent reads.
When MVCC coordinator node fails, a new one is elected among the live nodes – usually the oldest one.
The main goal of the MVCC coordinator failover is to restore an internal state of the previous coordinator in the new one. The internal state of MVCC coordinator consists of two main parts:
Due to Ignite partition map exchange design all write transactions should be finished before topology version is changed. Therefore there is no need to restore active transactions list on the new coordinator because all old transactions are either committed or rolled back during topology changing.
The only thing we have to do – is to recover the active queries list. We need this list to avoid old versions cleanup when there are any old queries are running over this old data because it could lead to query result inconsistency. When all old queries are done we can safely continue cleanup old versions.
To restore active queries at the new coordinator the MvccQueryTracker object was introduced. Each tracker is associated with a single query. The purpose of the tracker is:
Active queries list recovery on the new coordinator looks as follows:
Each read operation outside an active transaction or in scope of an optimistic transaction gets or uses a previously received Query Snapshot (which considered as read version for optimistic Tx. Note: optimistic transactions cannot be used in scope of DML operations).
All requested snapshots are tracked on Version Coordinator to prevent cleaning up the rows are read.
All received snapshots are tracked on local node for Version Coordinator recovery needs.
Query Snapshot is used for versions filtering (REPEATABLE_READ semantics).
Each read operation in scope of active pessimistic Tx uses its (transaction) snapshot for versions filtering (REPEATABLE_READ semantics).
On failure the node, which requested a Query Snapshot but not sent QueryDone message to Version Coordinator, such snapshot is removed from active queries map. Rows, which are not visible for all other readers, become available for cleaning up.
The row is considered as visible for read operation when it has visible (COMMITTED and in past) MVCC version (create version) and invisible (ACTIVE or ROLLED_BACK or in future) new MVCC version (update version).
Invisible for all readers rows are cleaned up by writers, as was described above, or by Vacuum procedure (by analogy with PostgreSQL).
During Vacuum all checked rows (which are still visible for at least one reader) are actualized with TxLog by setting special hint bits (most significant bits in MVCC operation counter) which show the state of Tx that created the row.
After all rows are processed, corresponding TxLog records can be deleted as well.
Related documents:
View file name 2017.mvcc.vldb.pdf height 250 View file name concurrency-distributed-databases.pdf height 250 View file name p209-yu.pdf height 250 View file name rethink-mvcc.pdf height 250
Related threads:
Suggestion to improve deadlock detection