Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Internal Data Structures Changes

BTree leafs structure is changed as follow:

Code Block
languagetext
|           key part          |           |         |
|-----------------------------|  lockVer  |   link  |
| cache_id | hash |  mvccVer  |           |         |

 

cache_id - cache ID if it is a cache in a cache group
hash - key hash
mvccVer - MVCC version of transaction who which has created the row
lockVer - MVCC version of transaction who holds which holds a lock on the rowlink - link to the data

other fields are obvious.

Rows with the same key are placed from newest to oldest.

Index BTree leafs structure is changed as follow:

Code Block
languagetext
|    key part    |
|----------------|
| link | mvccVer |

...

link - link to the data
mvccVer - XID of transaction who created the row

Data row payload structure is changed as follow:

Code Block
languagetext
|              |           |         |            |          |           |             |             |             |
| payload size | next_link | mvccVer | newMvccVer | cache_id | key_bytes | value_bytes | row_version | expire_time |
|              |           |         |            |          |           |             |             |             |

...

other fields are obvious.

Locks

During DML or SELECT FOR UPDATE tx aquires UPDATE statements Tx acquires locks one by one.

If the row is locked by another tx, current tx saves the context (cursor and current position in it) and register itself as a tx Tx state listener. As soon as previous tx Tx is committed or rolled back it fires an event. This means all locks, which are acquired by this txTx, are released. So, waiting on locked row tx Tx is notified and continues locking/writing.

TxLog is used to determine lock state, if tx Tx with XID MVCC version equal to row 'xid' field lock version (see BTree leafs structure) is active, the row is locked by this TX. All newly created rows have  lock version the same as its MVCC version, so, all newly created rows are locked by Tx, in scope of which they was created.

'xid' field value the same as 'ver' field value. Since, as was described above, rows with the same key are placed from newest to oldest, we can determine lock state checking the first version of row only.

Transactional Protocol Changes

All the changes are written into cache (disk) at once to be visible for subsequent queries/scans in scope of transaction.

2PC (Two Phase Commit) is used for commit procedure but has no TX Tx entries (all the changes are already in cache), it is needed just to keep TxLog consistent on all data nodes (tx Tx participants).

Recovery Protocol Changes

There are several participant roles:

  • MVCC coordinator
  • TX coorinator
  • Primary data node
  • Backup datanode

Each participant may have several roles at the same time.

So, there are steps to recover each type of participant:

On MVCC coordinator failure:

  1. A new coordinator is elected (the oldest server node, may be some additional filters)
  2. During exchange each node sends its active txs.
  3. The new coordinator merges chunks and restores its active Tx list.
  4. After merge is done it continues versions requests processing.

On TX coordinator failure:

  1. Each tx node sends to MVCC coordinator Tx nodes list.
  2. Local tx state is checked
    1. If tx in COMMITTED state it adds a local TxLog record with XID and COMMITTED flag and sends a tx committed message to MVCC coordinator (Tx is remooved from the local active Tx list, all locks are released at this point).
    2. If tx in ABORTED state it adds a local TxLog record with XID and ABORTED flag and sends a tx rolled back message to MVCC coordinator (Tx is remooved from the local active Tx list, all locks are released at this point).
    3. If tx in ACTIVE state on the node it adds a local TxLog record with XID and ABORTED flag and sends a tx rolled back message to MVCC coordinator (Tx is remooved from the local active Tx list, all locks are released at this point).
  3. Other tx participants are checked.
    1. If there is at least one node with COMMITTED state it adds a local TxLog record with XID and COMMITTED flag and sends a tx committed message to MVCC coordinator (Tx is remooved from the local active Tx list, all locks are released at this point).
    2. If there is at least one node with ABORTED state it adds a local TxLog record with XID and ABORTED flag and sends a tx rolled back message to MVCC coordinator (Tx is remooved from the local active Tx list, all locks are released at this point).
    3. If there is at least one node with tx in ACTIVE state it adds a local TxLog record with XID and ABORTED flag and sends a tx rolled back message to MVCC coordinator (Tx is remooved from the local active Tx list, all locks are released at this point).
  4. When MVCC coordinator recieves messages from all Tx participants, Tx is remooved from the active Tx list, all the changes become visible.

There is a special case when Tx coordinator fales right before tx committed message sending

  1. If MVCC coordinator doesn't recieve Tx nodes list from any Tx participant it broadcasts a Tx check message. 
  2. When MVCC coordinator recieves acknowledges from all Tx participants, Tx is remooved from the active Tx list, all the changes become visible.

On primary data node failure

If primary node fails during update we may apply some strategy and whether retry statement ignoring previous changes (using cid) or rollback tx.

if primary node fails during prepare we check whether partitions have not been lost and continue commit procedure in case all partitions are still available.

On loss of partition

A local TxLog record with XID and IN_DOUBT state is added on each tx participant. These records are preserved until all data owners rejoins the cluster. Such txs are considered as active for all readers.

On tx datanode rejoin

In case there is a tx IN_DOUBT state on for rejoining node and all data is awailable now, tx is rolled back and removed from the active Tx list on all nodes.

Read (getAll and SQL)

Each read operation outside active transaction creates a special read only transaction and uses its tx snapshot for versions filtering.

RO transactions are added to active Tx lists on reader (near) node and MVCC coordinator.

On reaer node failure all RO txs are removed from the active Tx list on MVCC coordinator.

Each read operation within active transaction uses its tx snapshot for versions filtering (REPEATABLE_READ semantics).

During get operation the first passing MVCC filter item is returned.

During secondary indexes scans 'ver' field of tree item is checked, if row version is visible (the row was added by current or committed tx) 'xid_max' field of referenced data row is checked - the row considered as visible if it is the last version of row 'xid_max' is NA or ACTIVE or ABORTED or higher than assigned.

During primary indexes scans 'ver' field of tree item is checked, if row version is visible (the row was added by current or committed tx) 'xid_max' field of referenced data row is checked - the row considered as visible if it is the last version of row 'xid_max' is NA or ACTIVE or ABORTED or higher than assigned.

During primary indexes scans 'ver' field of tree item is checked, the first passing MVCC filter item is returned, all next versions of row are skipped.

...

Near Tx node has to to notify Version Coordinator about final Tx state to make changes visible for subsequent reads.

Version Coordinator recovery


Read (get and SQL)

Each read operation outside an active transaction or in scope of an optimistic transaction gets or uses a previously received Query Snapshot (which considered as read version for optimistic Tx. Note: optimistic transactions cannot be used in scope of DML operations).

All requested snapshots are tracked on Version Coordinator to prevent cleaning up the rows are read.

All received snapshots are tracked on local node for Version Coordinator recovery needs.

Query Snapshot is used for versions filtering (REPEATABLE_READ semantics).

Each read operation in scope of active pessimistic Tx uses its (transaction) snapshot for versions filtering (REPEATABLE_READ semantics).

On failure the node, which requested a Query Snapshot but not sent QueryDone message to Version Coordinator, such snapshot is removed from active queries map. Rows, which are not visible for all other readers, become available for cleaning up.

The row is considered as visible for read operation when it has visible (COMMITTED and in past) MVCC version (create version) and invisible (ACTIVE or ROLLED_BACK or in future) new MVCC version (update version).

Update (put and DML)

Update consist of next steps:

  1. obitain a obtain a lock (write current version into 'xid' field lockVer field)
  2. add a row with new version.
  3. delete aborted versions of row if exist (may be omitted for performance reasons)update xid_max of previous committed row if exist
  4. delete previous committed versions (less or equal to cleanup version of tx Tx snapshot) if exist (may be omitted for performance reasons)
  5. delete previous committed versions (less or equal to cleanup version of tx Tx snapshot) if exist from all secondary indexes (may be omitted for performance reasons)
  6. update new MVCC version of previous committed row if exist (newMvccVer field)
  7. add a row with new version
  8. add a row with new version to all secondary indexes.

Delete consists of next steps:

  1. obitain a obtain a lock (write current version into 'xid' field lockVer field)
  2. delete aborted versions of row if exist (may be omitted for performance reasons)update xid_max of previous committed row if exist
  3. delete previous committed versions (less or equal to cleanup version of tx snapshot) if exist (may be omitted for performance reasons)
  4. delete previous committed versions (less or equal to cleanup version of tx snapshot) if exist exist from all secondary indexes (may be omitted for performance reasons)
  5. update new MVCC version of previous committed row if exist (newMvccVer field). The row become invisible after Tx commit

Cleanup of old versions

Invisible for all readers rows are cleaned up by writers, as was described above, or by Vacuum procedure (by analogy with PostgreSQL).

During Vacuum all checked rows (which are still visible for at least one reader) are actualized with TxLog by setting special hint bits (most significant bits in MVCC operation counter) which show the state of Tx that created the row.

After all rows are processed, corresponding TxLog records can be deleted as well.