Background

There are two major issues when applying MTree in a distributed environment: cross-group consistency and catch-up efficiency. The cross-group consistency problem is induced by the fact that we have both data groups and a meta group, despite that data groups are disjoint with each other (they operate on different sub-trees of the MTree), data groups and the meta group may modify the same sub-tree concurrently, causing consistency problem in different replicas. For example, the system received 3 operations: CREATE SG root.sg1(O1), CREATE TS root.sg1.s1(O2), DELETE SG root.sg1(O3), CREATE SG root.sg1(O4); notice that O1, O3, and O4 are meta group operations and O2 is a data group operation, so O1, O3, and O4 must be executed sequentially, but the order between O1, O3, O4, and O2 is undefined; as a result, the execution order may be O1, O2, O3, O4 on ReplicaA, while on ReplicaB it can be O1, O3, O4, O2, consequently, root.sg1.s1 is visible on ReplicaB but not on ReplicaA, implying an inconsistency.

Catch-up efficiency is the result of the inability to find the differences between two MTrees. When a Raft follower is detected to fall behind by the leader, the leader sends the missing logs of the follower to it; but if the desired logs are already removed, a snapshot is sent which is a compacted form of all removed logs. To make the follower's MTree catch-up, a snapshot contains all timeseries that the data group manages, because the leader cannot figure out what timeseries the follower already has and what it does not. This results in drastic bandwidth waste especially when there is no difference between the two MTrees.

Versioned MTree

To resolve the above two problems, we propose Versioned MTree, an extension of MTree that embeds Raft log index as versions in it, to distinguish if there are any changes between two MTrees. We introduce the structural modifications and behavioral modifications over MTree in this section.

structural modifications

  1. For each StorageGroupMNode(SGNode for short), there are two new fields: `long majorVersion` and `long minorVersion`.
  2. For each PhysicalPlan, there are also two new fields: `long majorVersion` and `long minorVersion`.

behavioral modifications

  1. When a CreateStorageGroupPlan is converted to a meta log, its majorVersion is set to the index of the meta log.
  2. When an SGNode is created, its majorVersion is set to the majorVersion of the CreateStorageGroupPlan that created it, and its minorVersion is set to 0.
  3. When a PhysicalPLan that modifies a storage group is converted to a data log(Create/DeleteTimeseriesPlan, SetTemplatePlan...), the majorVersion of the plan is set to the major version of the related SGNode, and the minorVersion is set to the index of the data log.
  4. When the sub-tree in MTree of a storage group is modified (by Create/DeleteTimeseriesPlan, SetTemplatePlan), the minorVersion of the SGNode is set to the minorVersion of the plan.
  5. When a PhysicalPLan that modifies a storage group (has a majorVersion(VM_plan) >= 0) is to be applied, check the majorVersion of the associated SGNode(VM_node):
    1. If VM_node does not exist (meaning the SGNode does not exist), or VM_node < VM_plan, sync with the meta leader to update SGNodes;
    2. If VM_node still does not exist, skip the execution of the (follower), or report a STORAGE_GROUP_NOT_SET (leader);
    3. If VM_node == VM_plan, execute the plan normally;
    4. If VM_node > VM_plan, then the SGNode is deleted and recreated after then plan, skip the execution, because it operates on an SGNode that has been deleted.
  6. When creating a data group snapshot, query the minorVersions of SGNodes managed by the follower and compare them with the minorVersions of the local SGNodes, and only add timeseries in the sub-trees of SGNodes which has a different minorVersion.

Correctness

The statements in behavioral modifications 5. guarantee that any data operation Od on an SGNode is either performed on the same version of the SGNode or do nothing when the SGNode has been removed. This can be verified in two folds: 1. only 5.c. allows the execution of a plan and it requires VM_node == VM_plan, so all plans that are successfully executed must operate on the same version of the SGNode. 2. As VM_plan is assigned before the execution, there must be some point where the SGNode exists. If VM_node does not exist after synchronizing the leader, it is obviously removed; while if VM_node has changed, it must have been removed and recreated between the creation and execution of  Od, so Od operates on a deleted SGNode and thus can be ignored.

If the operation is executed on some replicas while skipped on others, there must be one meta operation that deletes the SGNode where the operation is executed. So when the executed replicas synchronize with the meta leader, the deletion must also be synchronized so the operations on deleted SGNode are no longer visible, so all replicas are consistent eventually.

  • No labels