Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Transactional Protocol Changes

Commit Protocol Changes

Commit

  1. When a tx Tx is started a new version is assighned and MVCC coordinator adds a local  TxLog record with XID and ACTIVE flagto its active Tx list.
  2. The the first change request to a datanode within the transaction produces a local TxLog record with XID and ACTIVE flag at the data node.the Tx is added to the local active Tx list.
  3. TX coordinator sends to each datanode node a tx prepare message.
  4. Each at the commit stage each tx node adds a local TxLog record with XID and PREPARED flag and sends an acknowledge to TX Tx coordinator.
  5. TX coordinator sends to MVCC coordinator each datanode node a tx committed finish message.
  6. MVCC coordinator adds TxLog record with XID and COMMITTED flag, all the changes become visible.
  7. MVCC coordinator sends to participants a commit acknowledged message, all tx datanodes mark tx as COMMITTED, all resources are released.

...

  1. Each tx node adds a local TxLog record with XID and COMMITTED flag and sends an acknowledge to Tx coordinator (Tx is remooved from the local active Tx list, all locks are released at this point).
  2. TX coordinator sends to MVCC coordinator Tx committed message, Tx is remooved from the active Tx list, all the changes become visible.

An error during commit

  1. When a tx is started a new version is assighned and MVCC coordinator adds a local  TxLog record with XID and ACTIVE flagto its active Tx list.
  2. The the first change request to a datanode within the transaction produces a local TxLog record with XID and ACTIVE flag at the data node.the Tx is added to the local active Tx list.
  3. TX coordinator sends to each datanode node a tx prepare message.
  4. Each at the commit stage each tx node adds a local TxLog record with XID and PREPARED flag and sends an acknowledge to TX coordinator Tx coordinator.
  5. In case at least one participant does not confirm commit, TX coordinator sends to each participant a tx rollback message.
  6. each Each tx node adds a local TxLog record with XID and ABORTED flag and sends an acknowledge to TX coordinatorcoordinator (Tx is remooved from the local active Tx list, all locks are released at this point).
  7. TX coordinator sends to MVCC coordinator node a tx rolled back message.
  8. MVCC coordinator adds TxLog record with XID and ABORTED flag.
  9. MVCC coordinator sends to participants a rollback acknowledged message, all resources are releasedmessage, Tx is remooved from the active Tx list.

Rollback

  1. When a tx Tx is started a new version is assighned and MVCC coordinator adds a local  TxLog record with XID and ACTIVE flagto its active Tx list.
  2. The the first change request to a datanode within the transaction produces a local TxLog record with XID and ACTIVE flag at the data node.the Tx is added to the local active Tx list.
  3. TX coordinator sends to each participant a tx rollback message.
  4. Each at the rollback stage each tx node adds a local TxLog record with XID and ABORTED flag and sends an acknowledge to TX coordinatorcoordinator (Tx is remooved from the local active Tx list, all locks are released at this point).
  5. TX coordinator sends to MVCC coordinator node coordinator node a tx rolled back message.
  6. MVCC coordinator adds TxLog record with XID and ABORTED flag.
  7. MVCC coordinator sends to participants a rollback acknowledged message, all resources are released, Tx is remooved from the active Tx list.

Recovery Protocol Changes

There are several participant roles:

...

  1. A new coordinator is elected (the oldest server node, may be some additional filters)
  2. During exchange each node sends its TxLogactive txs.
  3. The new coordinator merges all the TxLog chunks and checks all local states for each TX. In case data nodes have state conflicts next rules are used:
  4. if there is at least one node with TX in ABORTED state tx rollback message is send to all datanodes and whole TX is marked as ABORTED.
  5. if there is at least one node with TX in COMMITTED state whole TX is marked as COMMITTED and commit acknowledged message is send to all datanodes.
  6. if all datanodes have TX in PREPARED state whole TX is marked as COMMITTED and commit acknowledged message is send to all datanodes.
  7. TX cannot be in COMMITTED and ABORTED state at the same time on different nodes. In case a node cannot mark PREPARED tx as COMMITTED this node has to be forcibly stopped.chunks and restores its active Tx list.
  8. After merge is done it continues versions requests processing.

On TX coordinator failure:

  1. Each tx node sends to MVCC coordinator Tx nodes list.
  2. Local tx state is checked
    1. If tx in COMMITTED state it adds a local TxLog record with XID and COMMITTED flag and sends a tx committed message to MVCC coordinator (Tx is remooved from the local active Tx list, all locks are released at this point).
    2. If tx in ABORTED state it adds a local TxLog record with XID and ABORTED flag and sends a tx rolled back message to MVCC coordinator (Tx is remooved from the local active Tx list, all locks are released at this point).
    3. If tx in ACTIVE state on the node it adds a local TxLog record with XID and ABORTED flag and sends a tx rolled back message to MVCC coordinator (Tx is remooved from the local active Tx list, all locks are released at this point).
  3. Other tx participants are checked.
    1. If there is at least one node with COMMITTED state it adds a local TxLog record with XID and COMMITTED flag and sends a tx committed message to MVCC coordinator (Tx is remooved from the local active Tx list, all locks are released at this point).
    2. If there is at least one node with ABORTED state it adds a local TxLog record with XID and ABORTED flag and sends a tx rolled back message to MVCC coordinator (Tx is remooved from the local active Tx list, all locks are released at this point).
    3. If there is at least one node with tx in ACTIVE state it adds a local TxLog record with XID and ABORTED flag and sends a tx rolled back message to MVCC coordinator (Tx is remooved from the local active Tx list, all locks are released at this point).
  4. When MVCC coordinator recieves messages from all Tx participants, Tx is remooved from the active Tx list, all the changes become visible.

There is a special case when Tx coordinator fales right before tx committed message sending

  1. If MVCC coordinator doesn't recieve Tx nodes list from any Tx participant it broadcasts a Tx check message. 
  2. When MVCC coordinator recieves acknowledges from all Tx participants, Tx is remooved from the active Tx list, all the changes become visible
  3. A new coordinator is elected (an oldest server tx datanode it becames a Tx coordinator)
    1. in case the oldest server tx datanode has already finished the tx and released resources, nothing happens (that means that MVCC coordinator whether started acknowleging or failed and tx will be recovered during MVCC coordinator recovery)
  4. A new coordinator checks other nodes:
    1. In case at least one nodes has already finished the tx and released resources it does nothing (that means that MVCC coordinator whether started acknowleging or failed and tx will be recovered during MVCC coordinator recovery)
  5. In case the new coordinator has the transaction in ACTIVE state
    1. A tx rollback message is send to all tx data nodes.
    2. A tx rolled back message is send to MVCC coordinator node.
    3. MVCC coordinator sends to participants a rollback acknowledged message, all resources are released.
  6. In case the new coordinator has the transaction in COMMITTED state
    1. A tx committed message is send to MVCC coordinator node.
    2. MVCC coordinator sends to participants a commit acknowledged message, all tx datanodes mark tx as COMMITTED, all resources are released.
  7. In case the new coordinator has the transaction in ABORTED state:
    1. A tx rollback message is send to all tx data nodes.
    2. A tx rolled back message is send to MVCC coordinator node.
    3. MVCC coordinator sends to participants a rollback acknowledged message, all resources are released.
  8. In case the new coordinator has the transaction in PREPARED state
  9. A new coordinator checks all tx data nodes
  10. In case all participants have the tx in PREPARED state or at least one tx data node has the tx in COMMITTED state:
    1. A tx committed message is send to MVCC coordinator node.
    2. MVCC coordinator sends to participants a commit acknowledged message, all tx datanodes mark tx as COMMITTED, all resources are released.
  11. In case at least one datanode has tx in ABORTED state:
  12. A tx rollback message is send to all tx data nodes.
  13. A tx rolled back message is send to MVCC coordinator node.
  14. MVCC coordinator sends to participants a rollback acknowledged message, all resources are released.

On primary data node failure

...

if primary node fails during prepare we check whether partitions have not been lost and continue commit procedure in case all partitions are still available.

On loss of partition

We add a special meta record (let's say TxDataLostRecord) with tx, node id and list of lost partitions.

This means that we cannot clean out tx info with XID heigher than the lowest XID from the list.

The transaction during which the partition was lost will be aborted or committed in case PREPARE stage has been already compleeted.

The transaction will be fully finished on partition return or manually (deleting corresponding TxDataLostRecord) in case the partition is lost eternally.

During further rebalances nodes in addition to rows sends TxDataLostRecords and tx state records to finish corresponding transactions properly on failed node rejoinA local TxLog record with XID and IN_DOUBT state is added on each tx participant. These records are preserved until all data owners rejoins the cluster. Such txs are considered as active for all readers.

On tx datanode rejoin

In case there is a TxDataLostRecord a tx IN_DOUBT state on for rejoining node , partitions from the record will not be invalidated or rebalanced.Corresponding transactions are finished on all nodes (using tx->partitions map from tx metadata), TxDataLostRecords are deletedand all data is awailable now, tx is rolled back and removed from the active Tx list on all nodes.

Read (getAll and SQL)

Each read operation outside active transaction creates a special read only transaction and uses its tx snapshot for versions filtering.

...