Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

TxLog

To be able to determine a state of TX, which has changed Tx that created a particular row, a special structure (TxLog) is introduced. TxLog is a table (can be persistent in case persistence enabled) which contains MVCC version to transaction state mappings.

TxLog is used to keep all the data consistent on cluster crush and recovery as well:

  • If a particular Tx has ACTIVE or ROLLED_BACK state on at least one data node it marks as ROLLED_BACK on all nodes. 
  • If a particular Tx has COMMITTED state on at least one data node it marks as COMMITTED on all nodes.
  • If a particular Tx has PREPARED state on all data nodes and all involved partitions are available it marks as COMMITTED on all nodes.
  • If a particular Tx has PREPARED state on all data nodes and at least one involved partition is lost (unavailable) it is left in PREPARED state, all the entries are locked for updates until state is changed manually or lost partition become available.

...

Code Block
languagetext
|           key part          |           |         |
|-----------------------------|  lockVer  |   link  |
| cache_id | hash |  mvccVer  |           |         |

...


mvccVer - MVCC version of transaction which has created the row
lockVer - MVCC version of transaction which holds a lock on the row

...

Code Block
languagetext
|    key part    |
|----------------|
| link | mvccVer |

...


link - link to the data
mvccVer - XID of transaction who created the row

...

Code Block
languagetext
|              |           |         |            |          |           |             |             |             |
| payload size | next_link | mvccVer | newMvccVer | cache_id | key_bytes | value_bytes | row_version | expire_time |
|              |           |         |            |          |           |             |             |             |

 


mvccVer - TX id which created this row.
newMvccVer - TX id which updated this row or NA in this is the last row version (need to decide whether the row is visible for current reader).

...

Near Tx node has to to notify Version Coordinator about final Tx state to make changes visible for subsequent reads.

Version Coordinator recovery

When MVCC coordinator node fails, a new one is elected among the live nodes – usually the oldest one.

The main goal of the MVCC coordinator failover is to restore an internal state of the previous coordinator in the new one. The internal state of MVCC coordinator consists of two main parts:

  • Active transactions list.
  • Active queries list.

Due to Ignite partition map exchange design all write transactions should be finished before topology version is changed. Therefore there is no need to restore active transactions list on the new coordinator because all old transactions are either committed or rolled back during topology changing.

The only thing we have to do – is to recover the active queries list. We need this list to avoid old versions cleanup when there are any old queries are running over this old data because it could lead to query result inconsistency. When all old queries are done we can safely continue cleanup old versions.

To restore active queries at the new coordinator the MvccQueryTracker object was introduced. Each tracker is associated with a single query. The purpose of the tracker is:

  • To mark each query with an unique id for a solid query tracking.
  • To hold a query MVCC snapshot.
  • To report to the new MVCC coordinator about the associated active query in case of old coordinator failure.
  • To send acks to the new coordinator when the associated query is completed.

Active queries list recovery on the new coordinator looks as follows:

  1. When old coordinator fails, an exchange process started and the new coordinator is elected.
  2. During this process each node sends a list of active query trackers to the new coordinator.
  3. New coordinator combine all those lists to the global one.
  4. When an old query finishes, the associated query tracker sends an ack to the new coordinator.
  5. Coordinator removes this tracker from the global list when ack is received.
  6. When global list becomes empty, this means that all old queries are done and we do not have to hold old date versions in our store – cleanup process begins.

Read (get and SQL)

Each read operation outside an active transaction or in scope of an optimistic transaction gets or uses a previously received Query Snapshot (which considered as read version for optimistic Tx. Note: optimistic transactions cannot be used in scope of DML operations).

...

After all rows are processed, corresponding TxLog records can be deleted as well. 

Related documents:

View file
name2017.mvcc.vldb.pdf
height250
View file
nameconcurrency-distributed-databases.pdf
height250
View file
namep209-yu.pdf
height250
View file
namerethink-mvcc.pdf
height250

Related threads:

Historical rebalance

Suggestion to improve deadlock detection