Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Each Member periodically sends HeartbeatMesage (UDP) to all the other members in the distributed system, including the coordinator. Upon receiving the HearbeatMessage, the receiving member updates its record of receiving timestamp associated with the sender of the timestamp when the message is received. The receiver does not reply to such HeartbeatMessages with requestID equals -1.

 

PlantUML
title This diagram shows a member (M) sending HeartbeatMessage to all the other members in the distributed system (N, C)
hide footbox
entity M
entity N
entity C
M -> N: HeartbeatMessage(-1)
note right : update the record of timestamp
M --> C : HeartbeatMessage(-1)
note right : update the record of timestamp

...

For each member, another monitoring thread is checking the timestamp of its neighbor's HeartbeatMessage. If it has not received the HeartbeatMessage from its neighbor for more than 5 seconds (the default for membership timeout). It will start sending HeartbeatRequestMessage to its neighbor to check whether its neighbor is still alive. If its neighbor is still alive, it will respond with HeartbeatMessage. Upon receiving the response from its neighbor, the member will update its record of timestamp accordingly. If no response for HeartbeatRequestMessage is received from its neighbor, it will check the its timestamp record again. It is possible that the neighbor has sent another HeartbeatMessage during the waiting period.

PlantUML
title This diagram shows a member (M) using a HeartbeatRequestMessage to check its Neighbor (N)
hide footbox
entity M
entity N
note right of M
check its neighbor
end note
M -> N: HeartbeatRequestMessage(requestID)
note right : via UDP
N --> M : HeartbeatMessage(requestID)

...

If there is still no response from its neighbor, and no update on the timestamp, the member will initiate a suspicion on its neighbor. Basically, the member will then send SuspectMemberMessage which includes a list of SuspectRequests to the coordinators. Upon receiving of the SuspectMemberMessage, the coordinator then send HeartbeatMessage to the member. It will also start final check on the suspect member. To do final check, the coordinator sends HeartbeatRequestMessage to the suspect member, expecting a response from suspect member. At the same time, it the coordinator also start starts TCP final check, which initiate initiates a TCP connection to the suspect member and exchange exchanges the messages if the suspect member is still alive. 

 

PlantUML
title This diagram shows a member (M) usingnotifies aCoordinator locator (SC) discoveringof theSuspect CoordinatorMember (CS) and joining the Final Check Process
hide footbox
entity M
entity S
entity C
M -> S : HeartbeatRequestMessage(requestID)
note right : via UDP
note right of M
No response from S
after timeout
end note
M -> C : SuspectMembersMessage
note right of C : start final check of suspect member
C -> S : HeartbeatRequestMessage(requestID)
note right : via UDP
S --> C : HeartbeatMessage(requestID)
note right : via UDP
C -> S : (Version, ViewID, UUID)
note right : via TCP
S --> C : OK
note right : via TCP
 

If final check failed, the coordinator then asks GMSJoinLeave to remove the suspect member from the system.

 

PlantUML
title This diagram shows a member (M) using a locatornotifies Coordinator (C) of Suspect Member (S) discoveringand the CoordinatorFailed (C)Final andCheck joiningProcess
hide footbox
entity M
entity S
entity C
M -> S : HeartbeatRequestMessage(requestID)
note right : via UDP
note right of M
no response from S
after timeout
end note
M -> C : SuspectMembersMessage
note right of C : start final check of suspect member
C -> S : HeartbeatRequestMessage(requestID)
note right : via UDP
C -> S : (Version, ViewID, UUID)
note right : via TCP
note right of C
no response from S after timeout
ask GMSJoinLeave to remove S
end note
 

...