In the discussion, we summarize some drawbacks of the current cluster module , and propose some ideas.

We welcome all contributors to supply the design (add comments to reply if you have no permission).

...

raft’s linearization serialization should keep consistent with the single module's linearization.
Data partitioning (how to partition data, and how to assign the data blocks to different nodes)
Data partitioning granularity: SG + time range.
Metrics in cluster
Scale-out can be only executed one by one
There are many synchronized key wordskeywords
How to partitioning partition the schema -> keep the current status
Current Raft and business codes are coupled -> decouple
There are too many thread pools -> SEDA
meta group covers all the nodes, which makes it impossible to scale to hundreds of nodes -> reduce to several nodes
Query framework -> MPP

New Design

...

1、Rules

The number of involved Replica Rule:
- Write operation: quorum
- Delete (metadata) operation: All nodes
  - WhichThismeans the Deletionoperationhasloweravailability.
- Query operation: one or half-replicas node
For all operations, once a node forwards incorrect, the receiver reply incorrect.
But as current we can not guarantee all raft groups having have the same leader in a data partition, we have to add a forward stage if the receiver is a follower. But if the receiver does not belong to the group, reply incorrectincorrection.

...

There are three types of managers in the cluster. Each node could exist three types of managers.

Partition Table Raft Groupmanagers
several Several nodes in the cluster. All other nodes must know them (from the configuration file).
Knows how the schema is partitioned, and how the data is partitioned.
Partitioned Schema managers
several Several nodes in the cluster.
Each Partitioned Schema has a raft group.
Partitioned Data managers
Each partitioned data has several raft groups (= the SG * virtual nodes in the partitioned table).

...

1. The coordinator checks whether the PT(Partition Table) cache has the SG(Storage Group)
2. if already have, then reject.
3. if not, then sends send it to the PT raft group
4. after success, update local PT cache
ThismeansallothernodesdonothavetheinfointheirPTcache.

3.2 DeleteSG

The coordinator sends sends to the PT raft group
PT group call raft protocol (to start operation and get approval).
PT group sends to ALL NODES
for each node, delete data, mtreeschema, update local PT cache
Once all nodes successsucceed, PT call raft protocol to finish deleting SG
Once some nodes fail, PT call raft protocol to stop deleting SG (meanssomenodesstillhavetheSG,schema,anddata)

When restarting, if the raft group has start started operation but have has no stop/finish operation, call stop.

...