Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

For each member, another monitoring thread is checking the timestamp of its neighbor's HeartbeatMessage . If it or any other messages received from its neighbor. Note that any message received from another member counts as a heartbeat. See GMSHealthMonitor.contactedBy(). If the member has not received the HeartbeatMessage from its neighbor for more than 5 seconds (the default for membership timeout). It will start sending HeartbeatRequestMessage to its neighbor to check whether its neighbor is still alive. If its neighbor is still alive, it will respond with HeartbeatMessage. Upon receiving the response from its neighbor, the member will update its record of timestamp accordingly. If no response for HeartbeatRequestMessage is received from its neighbor, it will check its timestamp record again. It is possible that the neighbor has sent another HeartbeatMessage during the waiting period.

PlantUML
title This diagram shows a member (M) using a HeartbeatRequestMessage to check its Neighbor (N)
hide footbox
entity M
entity N
note right of M
check its neighbor
end note
M -> N: HeartbeatRequestMessage(requestID)
note right : via UDP
N --> M : HeartbeatMessage(requestID)

...

If there is still no response from its neighbor, and no update on the timestamp, the member will initiate a suspicion on its neighbor. Basically, the member will then send SuspectMemberMessage which includes a list of SuspectRequests to the coordinators. Upon receiving the SuspectMemberMessage, the coordinator a list of recipients. Depending on the size of the view, the list of recipients may contain all the members in the view, if the view size is less than or equal to 4. If the view size is larger than 4, the list of recipients may have up to 7 members, which includes 5 members preferred to be coordinator, the sender itself and a random member. How the recipient of the SuspectMemberMessage reacts depends on whether it is the coordinator or not. If it is the coordinator, it will start final check on the suspect member. To do final check, the coordinator sends HeartbeatRequestMessage to the suspect member, expecting a response from suspect member. At the same time, the coordinator also starts TCP final check, which initiates a TCP connection to the suspect member and exchanges the messages if the suspect member is still alive. 

...

 

PlantUML
title This diagram shows a member (M) notifies Coordinator (C) of Suspect Member (S) and the Failed Final Check Process
hide footbox
entity M
entity S
entity C
M -> S : HeartbeatRequestMessage(requestID)
note right : via UDP
note right of M
no response from S
after timeout
end note
M -> C : SuspectMembersMessage
note right : via UDP
note right of C : start final check of suspect member
C -> S : HeartbeatRequestMessage(requestID)
note right : via UDP
C -> S : (Version, ViewID, UUID)
note right : via TCP
note right of C
no response from S after timeout
ask GMSJoinLeave to remove S
end note
 

If the recipient of SuspectMembersMessage is not the coordinator, it checks to see if it should become the coordinator and initiate a final check. If it is not the coordinator and should not become the coordinator, it records the SuspectRequest for subsequent use.

 

PlantUML
title This diagram shows a member (M) notifies non-coordinator member (NC) of Suspect Member (S)
hide footbox
entity M
entity S
entity NC
M -> S : HeartbeatRequestMessage(requestID)
note right : via UDP
note right of M
no response from S
after timeout
end note
M -> NC : SuspectMembersMessage
note right : via UDP
note right of NC
not a coordinator and
should not become a coordinator
record SuspectRequests
end note