Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In the discussion, we summarize some drawbacks of the current cluster module , and propose some ideas.

We welcome all contributors to supply the design (add comments to reply if you have no permission).

...

  • raft’s linearization serialization should keep consistent with the single module's linearization.
  • Data partitioning (how to partition data, and how to assign the data blocks to different nodes)
  • Data partitioning granularity: SG + time range.
  • Metrics in cluster
  • Scale-out can be only executed one by one
  • There are many synchronized key wordskeywords
  • How to partitioning partition the schema -> keep the current status
  • Current Raft and business codes are coupled -> decouple
  • There are too many thread pools -> SEDA
  • meta group covers all the nodes, which makes it impossible to scale to hundreds of nodes -> reduce to several nodes
  • Query framework -> MPP

New Design

...

1、Rules

  • The number of involved Replica Rule:
    • Write operation: quorum
    • Delete (metadata) operation: All nodes
      • WhichThismeans the Deletionoperationhasloweravailability.
    • Query operation: one or half-replicas node
  • For all operations, once a node forwards incorrect, the receiver reply incorrect.
    • But as current we can not guarantee all raft groups having have the same leader in a data partition, we have to add a forward stage if the receiver is a follower. But if the receiver does not belong to the group, reply incorrectincorrection.

...

2、Roles

There are three types of managers in the cluster. Each node could exist three types of managers.

  • Partition Table Raft Groupmanagers
    • several Several nodes in the cluster. All other nodes must know them (from the configuration file).
    • Knows how the schema is partitioned, and how the data is partitioned.
  • Partitioned Schema managers
    • several Several nodes in the cluster.
    • Each Partitioned Schema has a raft group.
  • Partitioned Data managers
    • Each partitioned data has several raft groups (= the SG * virtual nodes in the partitioned table).

...

3、Process

3.1 CreateSG

  • 1. The coordinator checks whether the PT(Partition Table) cache has the SG(Storage Group)
  • 2. if already have, then reject.
  • 3. if not, then sends send it to the PT raft group
  • 4. after success, update local PT cache
  • ThismeansallothernodesdonothavetheinfointheirPTcache.

3.2 DeleteSG

  • The coordinator  sends sends to the PT raft group
  • PT group call raft protocol (to start operation and get approval).
  • PT group sends to ALL NODES
  • for each node, delete data, mtreeschema,  update local PT cache
  • Once all nodes successsucceed, PT call raft protocol to finish deleting SG
  • Once some nodes fail, PT call raft protocol to stop deleting SG (meanssomenodesstillhavetheSG,schema,anddata)

When restarting, if the raft group has start started operation but have has no stop/finish operation, call stop.

...