You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

The discussion is from this page: Chinese

In the discussion, we summarize some drawbacks of current cluster module, and propose some ideas.

Current issues and Solutions

  • raft’s linearization serialization should keep consistent with the single module's linearization.
  • Data partitioning (how to partition data, and how to assign the data blocks to different nodes)
  • Data partitioning granularity: SG + time range.
  • Metrics in cluster
  • Scale out can be only executed one by one
  • There are many synchronized key words
  • How to partitioning the schema -> keep current status
  • Current Raft and business codes are coupled -> decouple
  • There are too many thread pools -> SEDA
  • meta group covers all the nodes, which makes it impossible to scale to hundreds of nodes -> reduce to several nodes
  • Query framework -> MPP

New Design

1 Rules

  • number of involved Replica Rule:
    • Write operation: quorum
    • Delete (metadata) operation: All nodes
      • Which means Deletion operation has lower availability.
    • Query operation: one node
  • For all operations, once a node forwards incorrect, the receiver reply incorrect.
    • But as current we can not guarantee all raft groups having the same leader in a data partition, we have to add a forward stage if the receiver is a follower. But if the receiver does not belong to the group, reply incorrect.

2 Roles

  • Partition Table Raft Group
    • several nodes in the cluster. All other nodes must know them (from configuration file).
    • Knows how the schema is partitioned, and how the data is partitioned.
  • Partitioned Schema
    • several nodes in the cluster.
    • Each Partitioned Schema has a raft group.
  • Partitioned Data
    • Each partitioned data has several raft groups (= the SG * virtual nodes in the partitioned table).

3 Process

3.1 Create SG

  • 1.coordinator checks whether PT cache has the SG
  • 2.if already have, then reject.
  • 3.if not, then sends to PT group
  • 4.after success, update local PT cache
  • This means all other nodes do not have the info in their PT cache.

3.2 Delete SG

  • coordinator  sends to PT group
  • PT group call raft protocol (to start operation and get approval).
  • PT group sends to ALL NODES
  • for each node, delete data, mtree,  update local PT cache
  • Once all nodes success, PT call raft protocol to finish deleting SG
  • Once some nodes fail, PT call raft protocol to stop deleting SG (means some nodes still have the SG, schema, and data)

When restarting, if raft group has start operation but have no stop/finish operation, call stop.

3.3 Query  a given SG

  • 1.coordinator checks whether PT cache has the SG
  • 2.if already have, then return.
  • 3.if not, then sends to PT group
  • 4.after success, update local PT cache


3.4 Query fuzzy SGs (“fuzzy” means paths that contains *)

  • 1.coordinator sends request to PT group
  • 2.after success, may update local PT cache

3.5  Create TS

  • 1.coordinator checks whether local schema has the TS
  • 2.if already have, then reject.
  • 3.if not, then searches PT cache (and may pull new cache)  and sends to corresponding data partition/group leader

This means all other nodes do not have the info in their PT cache.

3.6 Delete TS

  • 1.coordinator checks PT cache (and may pull new cache)
  • 2.coordinator sends to corresponding data group leader
  • 3.leader forward to ALL NODES
  • 4.for each node, delete data, mtree
  • Once all nodes success, PT call raft protocol to finish deleting TS
  • Once some nodes fail, PT call raft protocol to stop deleting TS (means some nodes still have the TS, and data)

When restarting, if raft group has start operation but have no stop/finish operation, call stop.

3.7 Query  given TSs (Schema)

  • 1.coordinator checks whether MManager has the TSs
  • 2.if already have, then return.
  • 3.if not, then query PT cache (and may pull cache), and sends to data group node (maybe not the leader)
  • 4.after success, update MManager

3.8  Query  fuzzy TSs ( “fuzzy” means paths that contains *)

  • 1.query PT cache (and may pull cache), and sends to data group node (maybe not the leader)
  • 2.after success, may update MManager

3.9 Write

  • 1.coordinator checks whether PT cache can process the request (and may pull cache)
  • 1.if not exist, then pull cache attached a update request to PT Group.
  • 2.if the SG exists in the PT but the time range does not exist, then need to send update to PT Group.
  • 2.sends to corresponding data partition (not group) leader, i.e., Node B in the figure.
  • 3.Tips: A data partition may have  several raft groups (= SG * virtual SG, now)
  • 4.for each operation in B, check whether B is its corresponding raft group leader, if not, forward to the leader (this is just a temporary solution, will redesign finally)
  • 5.For each operation received by the raft group leader (may B, or C …), if it has no schema locally, then forward “Create TS” to corresponding data partition/group leader.
    (May receives “already exist”, which means this node begins to process a new time range of the TS. May receives “create success”, which means this TS is new.  Update local Mmanager. 
  • 6.call raft for replica.
  • 7.execute.

3.10 Query given TSs

  • No labels