Failed Member Timing

The purpose of this page is to describe the timing of the various events associated with detecting the failure of a single member. See the GMSHealthMonitor Message Sequence Diagram page for a more complete description of how the heartbeat monitoring protocol works, and what happens when heartbeats stop (or can't be detected). Here we'll focus only on the timing of some key steps in the process, and only in the situation where a single member (of the distributed system) stops generating periodic heartbeat messages, and is subsequently eliminated from the view.

In this scenario, a particular member (call it F) has failed in some way that has caused it to stop generating heartbeat messages to the member that's monitoring it (call it M). When M detects this situation, it requests a heartbeat from F and if it doesn't subsequently receive one, M will send the "suspect" message to a subset of the "preferred coordinators".

When the coordinator (C) sees a suspect message, it issues a heartbeat request message to F and simultaneously tries to make a TCP/IP connection to F. If it receives no heartbeat from F and the TCP/IP connection attempt fails then the coordinator removes F from the view, and shares that new view with the remaining members.

This diagram shows timelines for the three member (roles) of interest: the failing member F, the monitoring member M, and the coordinator C.

M is the member "to the left" or directly "counter-clockwise" in the view, relative to F.

Key variables that affect timing:

Tm is the member timeout, which defaults to 5s. It can be changed via the member-timeout configuration property.

L is the logical interval, which defaults to 2. It can be changed by the geode.logical-message-received-interval Java system property

T = Tm/L is the health monitor base period and it defaults to 2.5s

If M doesn't receive a heartbeat, or some other communication from F for Tm/L (=T) it will send a UDP unicast heartbeat request message to F.

After sending that heartbeat request, if M still detects no heartbeat message from F (and detects no other communication from F), then M will send a "suspect" message to a subset of the preferred coordinators.

When the coordinator C receives a suspect message (about F) it immediately sends a UDP unicast heartbeat request to F and attempts a TCP/IP connection to F. If the coordinator receives no heartbeat message from F, and receives no other communication from F, and the attempt to establish a direct TCP/IP connection to F fails, then the coordinator generates a new view with F removed. This view is shared with the remaining members.

The time from the final heartbeat generated from F to M, until the new view is distributed is, on average, about:

Tm/L + Tm/L/2 + Tm + Tm = (2 + 1/L + 1/2L) Tm

The first term Tm/L is the failure interval specified in GMSHealthMonitor.Monitor.run(). If F hasn't been heard from in Tm/L then M will send a heartbeat request to F.

The second term Tm/L/2 is due to the fact that GMSHealthMonitor.Monitor is scheduled (in a ScheduledThreadPoolExecutor) to run every Tm/L (=T). Sampling introduces an (average) T/2 delay into the timeline.

The third term is the time Tm that M waits for communication from F, after the heartbeat request is sent.

The fourth term is the period that the coordinator C waits (for communication from F, or for successful TCP/IP connection to F) after receiving the suspect message about F.

With default values of Tm = 5s and L = 2 we have:

(2 + 1/L + 1/2L) Tm = 13.75s

This analysis stopped with the new view production. It would be interesting to add a description of timings in F during this sequence, and afterward. But because we don't know why F has stopped sending heartbeats, it may be the case that F is operating in a degraded state.

Space shortcuts

Page tree

See Also