You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

GMSHealthMonitor makes sure that each member in the distributed is alive and communicating to this member. To make sure that we create the ring of members based on current view. On this ring, each member make sure that the next member in ring (its neighbor) is communicating with it. For that we record last message timestamp from its neighbor. And if it sees its neighbor has not communicated in last period(member-timeout) then we check whether its neighbor is still alive or not. Based on that we informed probable coordinators to remove its neighbor from the view.

HeartbeatMessage

Each Member periodically sends HeartbeatMesage (UDP) to all the other members in the distributed system, including the coordinator. Upon receiving the HearbeatMessage, the receiving member updates its record associated with the sender of the timestamp when the message is received. The receiver does not reply to such HeartbeatMessages. This diagram shows a member (M) sending HeartbeatMessage to all the other members in the distributed system (N, C) M N C HeartbeatMessage(-1) update the record of timestamp HeartbeatMessage(-1) update the record of timestamp

HeartbeatRequestMessage

For each member, another monitoring thread is checking the timestamp of its neighbor's HeartbeatMessage. If it has not received the HeartbeatMessage from its neighbor for more than 5 seconds (the default for membership timeout). It will start sending HeartbeatRequestMessage to its neighbor to check whether its neighbor is still alive. If its neighbor is still alive, it will respond with HeartbeatMessage. Upon receiving the response from its neighbor, the member will update its record of timestamp accordingly. If no response for HeartbeatRequestMessage is received from its neighbor, it will check the timestamp record again. It is possible that the neighbor has sent another HeartbeatMessage during the waiting period. This diagram shows a member (M) using a HeartbeatRequestMessage to check its Neighbor (N) M N check its neighbor HeartbeatRequestMessage(requestID) via UDP HeartbeatMessage(requestID)

SuspectMembersMessage and Final Check

If there is still no response from its neighbor, and no update on the timestamp, the member will initiate a suspicion on its neighbor. Basically, the member will then send SuspectMemberMessage which includes a list of SuspectRequests to the coordinators. Upon receiving of SuspectMemberMessage, the coordinator then send HeartbeatMessage to the member. It will also start final check on the suspect member. To do final check, the coordinator sends HeartbeatRequestMessage to the suspect member. At the same time, it also start TCP final check, which initiate a TCP connection to the suspect member and exchange the messages if the suspect member is still alive.  This diagram shows a member (M) using a locator (S) discovering the Coordinator (C) and joining M S C HeartbeatRequestMessage(requestID) via UDP No response from Safter timeout SuspectMembersMessage HeartbeatRequestMessage(requestID) (Version, ViewID, UUID) via TCP OK via TCPIf final check failed, the coordinator then asks GMSJoinLeave to remove the suspect member from the system. This diagram shows a member (M) using a locator (S) discovering the Coordinator (C) and joining M S C HeartbeatRequestMessage(requestID) via UDP no response from Safter timeout SuspectMembersMessage HeartbeatRequestMessage(requestID) (Version, ViewID, UUID) via TCP no response from Sask GMSJoinLeave to remove S 

 

  • No labels