Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Fig.1 shows the class diagram of important classes in the cluster module. For simplicity, class members and class methods are omitted here as they are just too many. Helper classes and utility classes are also ignored, so you would find much more classes in the cluster module.

Figure.1 Overall Class Topology

Class Functionalities

Program Entry Point

...

Servers are implementations of Thrift RPC servers, their main concerns are network transport and protocol parsing. Each of them listens on a port and an ip IP address, accept connection from clients, wrap connections with certain types of Thrift transports, read requests from the transports, parse them with specific protocols, and send the requests to associated RaftMembers. Each node only have one each server.

  • MetaClusterServer: listens to internal_meta_port, receives MetaGroup-related Raft requests, and forwards them to MetaMember. It also starts the underlying IoTDB once it is up.
  • MetaHeartbeatServer: listens to internal_meta_port + 1, receives MetaGroup heartbeats, and forwards them to MetaMember.
  • DataClusterServer: listens to internal_data_port, receives DataGroup-related Raft requests, decides which DataMember will process it and forwards to it.
  • DataHeartbeatServer: listens to internal_data_port + 1, only receives DataGroup heartbeats, decides which DataMember will process it and forwards to it.
  • ClientClusterServer: listens to external_client_port, receives database operations from clients, and forwards them to MetaMember.

Raft

...

Members

Raft Members are implementations and extensions of Raft algorithm, which handles log generation, log replication, heartbeat, election, and catch-up of falling behind nodes. Each RaftMember plays one role in exactlly exactly one Raft group. One cluster node has exactlly exactly one MetaMember and multiple DataMembers, depending on the number of replications and partitioning strategy.

  • RaftMember: an abstraction of common Raft procedures, like log generation, log replication, heartbeat, election, and catch-up of falling behind nodes.
  • MetaMember: an extension of RaftMember, representing a member in the MetaGroup, where cluster configuration, authentication info, and storage groups are managed. It holds the partition table, processes operations that may modify the partition table, and also acts as a coordinator to route database opertions operations to responsible DataMember with the help of its partition table. One node has only one MetaMember.
  • DataMember: an extension of RaftMember, representing a member in one DataGroup, where timeseries schemas and timeseries data of one cluster-level partition are managed. It also handles data queries and thus manages query resources. One node may have multiple DataMembers, depending on how many data partition it manages, which is further decided by the number of replications and partitioning strategy.

...

Partition Table maintains the mapping from a partition key (storage group name and time partition id in IoTDB) to a DataMember(or a node). As timeseries schemas and data are replicated only in a subset of the cluster (i.e., a DataGroup), opertions operations involving them can only be applied to specific nodes. When a node receives an operation, and it does not manage the corresponding data or schemas of the operation, the node must forward it to the right node, and this is where a Partition Table applies.

...

Heartbeat Thread is an important component of a Raft Member. As the chracter character of a Raft Member is alwayls always one of LEADER, FOLLOWER, or ELECTOR(CANDIDATE), it utilizes Heartbeat Thread in three ways: as a LEADER, Heartbeat Thread sends heartbeats to followers; as a FOLLOWER, Heartbeat Thread periodicallly periodically checks whether the leader has timed out; as a CANDIDATE, it starts elections and send election requests to other nodes until one of the nodes becomes a valid leader.

  • HeartbeatThread: an abstarction abstraction of heartbeat-related transactions, and each RaftMember will have its own HeartbeatThread. When the associated member is a LEADER, the thread sends heartbeat to each follower periodically; if the member is a FOLLOWER, it checks the timestamp when the last heartbeat was received, and decide whether the leader has timed out; otherwise the member is an elector, the thread will start an indefinite election loop, asks for votes from other members in this Raft group, and break from the loop once itself or another node has become a leader.
  • MetaHeartbeatThread: an extension of HeartbeatThread for MetaMember. It requries requires the the follower to generate a new identifier, if its identifier conflicts with a known node during the cluster-building phase. It also sends followers the partition table if it is required.
  • DataHeartbeatThread: an extension of HeartbeatThread for DataMember. It includes the header node of its group so the receiver knows which group this heartbeat if from, and it includes the log progress of its associated MetaMember to ensure the leader of a data group also has up-to-date partition table.

...

  • RaftLogManager: a basic RaftLogManager, which manages lastest latest logs in memory and colder logs on disks, providing interfaces such as append, commit, and get. It applies logs after they are committed, but how they are applied depends on the passed LogApplier. It provides no snapshot implementation, so sub-classes extending it should focus on their own definition of snapshots.
  • MetaSingleSnapshotLogManager: an extension of RaftLogManager for MetaMembers, which provides MetaSimpleSnapshot containing storage groups, TTLs, users, roles, and partition tables.
  • FilePartitionedSnapshotLogManager: an extension of RaftLogManager for DataMembers, which provides PartitionedSnapshot<FileSnapshot>, consisting of a map whose keys are slot numbers, and values are FileSnapshots. FileSnapshot contains timeseries schemas and hardlinked hardlink TsFiles (in the form of RemoteTsFileResource) of a slot, which are exactly what IoTDB holds in a slot.


Workflows

Below are some important workflows involved in the cluster module, including cluster setup, election, and data ingestion or query. Only non-exceptional flows are shown currently, because almost all exceptions will be thrown directly to the upper level,  converted to erroneous status codes, and finally returned to the clients. Some unimportant returns are omitted to reduce number of lines and keep the figure clear.

Start-up(leader)

The procedure of start-up of a MetaGroup leader is shown in fig.2. Please notice that DataMembers are activated by MetaMembers once it have constructed or received a partition table, and DataMembers perform elections similarly to MetaMembers, so DataMembers are not introduced separately. The procedure is detailed below:

  1. Start-up script `start-node.sh` creates ClusterMain, the program entry point of a cluster node;
  2. ClusterMain creates a MetaClusterServer, which receives MetaGroup RPCs from internal_meta_port;
  3. MetaClusterServer creates a MetaMember, which is initialized as an ELECTOR, handles MetaGroup RPCs and manages a partition table;
  4. MetaMember tries to load the partition table from the local storage if it exists;
  5. MetaMember creates its MetaHeartbeatThread;
  6. MetaHeartbeatThread sends election requests to other cluster nodes;
  7. The quorum of the MetaGroup agree with the election and send responses to MetaClusterServer;
  8. MetaClusterServer lets MetaMember handle these responses;
  9. MetaMember gathers the responses and confirms that it has become a LEADER, then create a partition table if there is none;
  10. MetaMember creates DataClusterServer, which receives DataGroup RPCs from internal_meta_port;
  11. DataClusterServer creates DataMembers depending on the partition table and number of replications;

Image Added

Figure.2 Start-up of the MetaGroup leader