raft’s linearization serialization should keep consistent with the single module's linearization.
Data partitioning (how to partition data, and how to assign the data blocks to different nodes)
Data partitioning granularity: SG + time range.
Metrics in cluster
Scale-out can be only executed one by one
There are many synchronized keywords
How to partition the schema -> keep the current status
Current Raft and business codes are coupled -> decouple
There are too many thread pools -> SEDA
meta group covers all the nodes, which makes it impossible to scale to hundreds of nodes -> reduce to several nodes
Query framework -> MPP

New Design

...

1. Rules

The number of involved Replica Rule:
- Write operation: quorum
- Delete (metadata) operation: All nodes
  - Thismeans the Deletionoperationhasloweravailability.
- Query operation: one or half-replicas nodenode or two (sync with the leader)
For all operations, once a node forwards incorrect, the receiver reply incorrect.
But as current we can not guarantee all raft groups have the same leader in a data partition, we have to add a forward stage if the receiver is a follower. But if the receiver does not belong to the group, reply incorrection.

...

2. Roles

There are three types of managers in the cluster. Each node could exist three types of managers.

Partition Table managers
Several nodes in the cluster. All other nodes must know them (from the configuration file).
Knows how the schema is partitioned, and how the data is partitioned.
Partitioned Schema managers
Several nodes in the cluster.
Each Partitioned Schema has a raft group.
Partitioned Data managers
Each partitioned data has several raft groups (= the SG * virtual nodes in the partitioned table).

3. Module

Partition table manager: manage data/schema partition table, provide service about partition table
Client RPC: receive requests(read/write * data/schema)
Coordinator: route requests to data/schema group according to partition table
PlanExecutor: handle requests, for write request, duplicate request by Raft module and execute the write.

3、Process

...

For query, split plan and route sub plans to related data group and merge result.
Raft: log replication
StorageEngine: manage local write request and data query
QueryExecutor: manage local query request
MManager: manage schema info
Node RPC: Internal RPC module

4. Process

4.1 CreateSG

1. The coordinator checks whether the PT(Partition Table) cache has the SG(Storage Group)
2. if already have, then reject.
3. if not, then send it to the PT raft group
4. after success, update local PT cache
ThismeansallothernodesdonothavetheinfointheirPTcache.

34.2 2 DeleteSG

The coordinator sends to the PT raft group
PT group call raft protocol (to start operation and get approval).
PT group sends to ALL NODES
for each node, delete data, schema, update local PT cache
Once all nodes succeed, PT call raft protocol to finish deleting SG
Once some nodes fail, PT call raft protocol to stop deleting SG (meanssomenodesstillhavetheSG,schema,anddata)

When restarting, if the raft group has started operation but has no stop/finish operation, call stop.

34.3 Query a given SG

1.coordinator checks whether PT cache has the SG
2.if already have, then return.
3.if not, then sends to PT group
4.after success, update local PT cache

34.4 4 Queryfuzzy SGs (“fuzzy” means paths that contains *)

1.coordinator sends request to PT group
2.after success, may update local PT cache

34.5 CreateTS

1.coordinator checks whether local schema has the TS
2.if already have, then reject.
3.if not, then searches PT cache (and may pull new cache) and sends to corresponding data partition/group leader

ThismeansallothernodesdonothavetheinfointheirPTcache.

34.6 6 DeleteTS

1.coordinator checks PT cache （and may pull new cache)
2.coordinator sends to corresponding data group leader
3.leader forward to ALL NODES
4.for each node, delete data, mtree
Once all nodes success, PT call raft protocol to finish deleting TS
Once some nodes fail, PT call raft protocol to stop deleting TS (meanssomenodesstillhavetheTS,anddata)

When restarting, if raft group has start operation but have no stop/finish operation, call stop.

34.7 Query given TSs (Schema)

1.coordinator checks whether MManager has the TSs
2.if already have, then return.
3.if not, then query PT cache (and may pull cache), and sends to data group node (maybe not the leader)
4.after success, update MManager

34.8 Query fuzzy TSs ( “fuzzy” means paths that contains *)

1.query PT cache (and may pull cache), and sends to data group node (maybe not the leader)
2.after success, may update MManager

34.9 9 Write

1.coordinator checks whether PT cache can process the request (and may pull cache)
1.if not exist, then pull cache attached a update request to PT Group.
2.if the SG exists in the PT but the time range does not exist, then need to send update to PT Group.
2.sends to corresponding data partition (not group) leader, i.e., Node B in the figure.
3.Tips: A data partition may have several raft groups (= SG * virtual SG, now)
4.for each operation in B, check whether B is its corresponding raft group leader, if not, forward to the leader (this is just a temporary solution, will redesign finally)
5.For each operation received by the raft group leader (may B, or C …), if it has no schema locally, then forward “Create TS” to corresponding data partition/group leader.
(May receives “already exist”, which means this node begins to process a new time range of the TS. May receives “create success”, which means this TS is new. Update local Mmanager.
6.call raft for replica.
7.execute.

34.10 10 Query givenTSs

1.coordinator checks whether PT cache can process the request (and may pull cache)
1.if not exist, then pull cache attached a update request to PT Group.
2.if the SG exists in the PT but the time range does not exist, then need to send update to PT Group.
2.sends to corresponding data partition (not group) leader (if weak consistency, follower is also ok), i.e., Node B in the figure.
3.Tips: A data partition may have several raft groups (= SG * virtual SG, now)
4.for each operation in B, check whether to call raft replica according to the consistency
5.Run locally.

Image Added

4.11 Query fuzzy TSs

1.coordinator checks whether PT cache can process the request
1.if not, then the PT group returns the result, and the coordinator pulls cache.
2.if ok, then the coordinator calculate the result locally.
2.sends to corresponding data partition (not group) leader (if weak consistency, follower is also ok), i.e., Node B in the figure.
3.Tips: A data partition may have several raft groups (= SG * virtual SG, now)
4.for each operation in B, check whether to call raft replica according to the consistency
5.Run locally.

Space shortcuts

Page tree

Versions Compared

Old Version 4

New Version Current

Key

New Design

1. Rules

2. Roles

3. Module

3、Process

4. Process

4.1 CreateSG

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 4

New Version Current

Key

New Design

1. Rules

2. Roles

3. Module

3、Process

4. Process

4.1 CreateSG