In the initial version of distributed MVCC we will use single transaction coordinator that will define the global transaction order for all transactions in the cluster. The coordinator may be a dedicated node in the cluster. Upon version coordinator failure a new coordinator should be elected in such a way that the new coordinator will start assigning new versions (tx XIDs as well) that is guaranteed to be greater than all previous transaction versions . This can be easily implemented by defining a transaction version as a tuple (TV, LV) where TV is cluster topology version which is set to 1 on the first node and is incremented on each topology change, LV is coordinator local version which is set to 1 when coordinator is elected and is incremented on each transaction version request. When sorted lexicographically, this satisfies the requirements above.

When a transaction write version is requested, version coordinator generates a new version and adds this version to the pending transactions set. When transaction is committed or rolled back, version coordinator removes this version from pending set. When a transactional read is requested, the version coordinator captures the current state of pending transactions in order to determine which versions should not be evicted until read is finished. When a read is finished, an acknowledgement is sent to coordinator. Coordinator periodically broadcasts the last safe to keep version to the whole cluster so that nodes can discard temporarily saved versions.

Internal Data Structures Changes

Besides the existing key-value storage, each node should introduce an additional versioned key-value storage used to filter out not-ready-to-be-read versions. Each versioned read attempt should merge results obtained from versioned store and general storage. Below is one of the possible data structures designs for SQL indexes (main PK index should be the same with the only exception that it returns a single value instead of a range cursor).

(using two longs: coordinator version which is a topology major version and starting from zero counter). To be able to restore all tx states on cluster restart or coordinator failure a special structure (TxLog) is introduced. TxLog is a table (can be persistent in case persistence enabled) which contains XID to transactions states [active, preparing, committed, aborted, etc] mappings.

Only MVCC coordinator has the whole table, other nodes have TxLog subset relaited to node-local TXs.

On MVCC coordinator failure new coordinator collects and merges all TxLog subsets from other nodes, after that it starts. At this time MVCC counter cannot be assigned or acknowleged, so that all new and committing TXs are waiting for the operation is completed.

Internal Data Structures Changes

BTree leafs structure is changed as follow:

Code Block

language	text

|           key part          |       |         |        |
|-----------------------------|  xid  |  flags  |  link  |
| cache_id | hash | ver | cid |       |         |        |

cache_id - cache ID if it is a cache in a cache group
hash - key hash
ver - XID of transaction who created the row
xid - xid of transaction who holds a lock on the row
cid - operation counter, the number of operation in transaction this row was changed by.
flags - allows to fast check whether the row visible or not
link - link to the data

Rows with the same key are placed from newest to oldest.

Index BTree leafs structure is changed as follow:

Code Block

language	text

|     key part     |       |
|------------------| flags |
| link | ver | cid |       |

link - link to the data
ver - XID of transaction who created the row
cid - operation counter, the number of operation in tx this row was changed by.
flags - allows to fast check whether the row visible or not

Data row payload structure is changed as follow:

Code Block

language	text

|              |          |         |         |           |             |             |             |
| payload size | cache_id | xid_min | xid_max | key_bytes | value_bytes | row_version | expire_time |
|              |          |

Code Block

language	java
title	MVCC LSM

class VersionedIndex {
    private NavigableMap<Key, List<VersionedRow>> versionedVals;
 
    public Cursor find(SearchRow lower, SearchRow upper, VersionInfo readVersion) {
        // Here we must first get a range of keys matching the search boundaries and then filter out the visible versions.
        // Note that versionedVals must explicitly contain deleted keys in order to be able to correctly merge results.
    }
}
 
class MergeIndex {
    public Cursor find(SearchRow lower, SearchRow upper, VersionInfo readVersion) {
        Cursor versioned = versionedIdx.find(lower, upper, readVersion);
 
        // Committed| index does not need readVersion because it contains values| that are 100% safe to read.
     |   Cursor committed = committedIdx.find(lower, upper);
 
      |  // Here we must return a merge cursor. In merge cursor keys| from versioned cursor must override keys from committed cursor.
    }
}

When a node receives a safe-to-keep version from coordinator, it should merge latest value (wrt safe-to-keep version) from versioned index to committed index.

xid_min - TX id which created this row.
xid_max - TX id which updated this row or NA in this is the last row version (used during secondary index scans).

other fields are obvious.

During scans 'ver' field is checked, if row version is visible (the row was added by current or committed tx) 'xid_max' field of referenced data row is checked - the row considered as visible if it is the last version of row ('xid_max' is NA)

Locks

During DML or SELECT FOR UPDATE tx aquires locks one by one.

If the row is locked by another tx, current tx saves the context (cursor and current position in it) and register itself as a tx state listener As soon as previous tx is committed or rolled back it fires an event. This means all locks, acquired by this tx, are released. So, waiting on locked row tx is notified and continues locking.

TxLog is used to determine lock state, if tx with XID equal to row 'xid' field (see BTree leafs structure) is active, the row is locked by this TX. All newly created rows have 'xid' field value the same as 'ver' field value. Since, as was described above, rows with the same key are placed from newest to oldest, we can determine lock state checking the first version of row onlyThe size of the Versioned Index should be limited to avoid OOME. When the size of the versioned index exceeds maximum allowed value, the oldest committed value is merged to the committed index and the versioned index itself must me marked with this version value to disallow invalid reads based on the evicted version (an exception must be thrown for such read).

Transactional Protocol Changes

Commit Protocol Changes

For the first implementation only optimistic serializable transaction will be supported (semantics of sequential pessimistic locking and snapshotable isolation is not yet defined).

On transaction prepare step, when all transactional locks are acquired, the transaction coordinator must send request to the version coordinator to acquire the transaction write version. At this step acquired version is added to coordinator pending versions. 1) at the finish stage each tx node adds a local TxLog record with XID and LOCALLY_COMMITTED flag
2) TX coordinator sends to MVCC oordinator node a tx finish message.
3) MVCC coordinator adds TxLog record with XID and COMMITTED flag, all the changes become visible.
4) TX coordinator (near tx node) sends to participants a commit acknowledged message, nodes asyncronously mark tx as COMMITTED.

Recovery Protocol Changes

...

Page tree

Versions Compared

Old Version 4

New Version 5

Key

Internal Data Structures Changes

Internal Data Structures Changes

Locks

Transactional Protocol Changes

Commit Protocol Changes

Recovery Protocol Changes

Page tree

Page History

Versions Compared

Old Version 4

New Version 5

Key

Internal Data Structures Changes

Internal Data Structures Changes

Locks

Transactional Protocol Changes

Commit Protocol Changes

Recovery Protocol Changes