Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

 

PlantUML
title
Details of handling of a network partition between L,
M and N on one side of the partition and A & B on the other side.
end title
hide footbox

entity L
entity M
entity N
entity A
entity B

note over L
The initial membership view is (L,A,B,M,N)
end note

M -> L : suspect(A)
note right of L
L receives a suspect message and asks
the health monitor to check on the
member.  The check fails, initiating
a removal.
end note

L ->x A : attempt to contact
L -> M : prepare view (L,B,M,N)
L -> N : prepare view (L,B,M,N)
L ->x B : prepare view (L,M,N,B)
note leftright
this message does not reach B due
to the network partition
end note
M --> L : ack
N --> L : ack
L -> L : wait for ack from\nB times out
L ->x B : attempt to contact

L -> L : quorum calculation on (OK)L,M,N) fails!
L -> M : prepare/install view (L,M,N)forceDisconnect!
L -> N : prepare/install view (L,M,N)forceDisconnect!
|||

newpage
B ->losing Aside : of the partition

B -> A : suspect(L)
note left of A
A & B start suspecting members on
the other side of the split. When L
is deemed dead A will become coordinator
because it is next in line in view
(L,A,B,M,N)
end note

A -> A : becomeCoordinator
note left of A
after suspect processing A decides
to become coordinator and send view
(A,B,M,N)
end note

A ->x M : prepare view(A,B,M,N)
A ->x N : prepare view(A,B,M,N)
A -> B : prepare view(A,B,M,N)
B --> A : ack
A -> A : times out waiting for acks
A ->x M : attempt to contact
A ->x N : attempt to contact
A -> A : quorum calculation on (A,B) passes
A -> B : forceDisconnect!
note left: A and B shut downprepare/install view (A,B)

 

Here we see a network partition with three members on one side (L, M and N) and two members on the other (A and B).  When this happens the HealthMonitor is responsible for initiating suspect processing on members that haven't been heard from and cannot be contacted.  It eventually tells the JoinLeave component to remove one of the members.  All members of the distributed system perform this kind of processing and it can lead to a member deciding that the current Coordinator is no longer there and electing itself as the new Coordinator.

In the above diagram we see this happen on the A,B side of the partition.  B initiates suspect processing on the current Coordinator (process L) and it notifies A.  A performs a health check on L and decides it is unreachable, electing itself coordinator.  It then sends a prepare-view message with (A,B,M,N) and expects acknowledgement but receives none from M or N.  After checking on M and N via the HealthMonitor it kicks them out, creating the new view (A,B).  This view change passes the quorum check so A prepares and installs it.

On the losing side of the partition L is notified that A is suspect and kicks it out, creating view (L,B,M,N).  It prepares the view but gets no response from B.  It kicks B out, forming view (L,M,N).  This view has lost quorum so instead of sending it out it notifies M and N that they should shut down and then shuts itself down as well.

Let's look at the quorum calculations themselves:  If L is a locator and the others are normal members the initial membership view, (L,A,B,M,N) would have weights (3,15,10,10,10) .  Locators only have 3 points of weight, regular members get 10 and the so-called Lead Member gets an additional 5 points of weight.  View (A,B) represents a loss of 23 of those points while view (L,M,N) represents a loss of 25 points which is more than 50% of the total weight of the initial view.  This why A and B remained standing while L, M and N shut down even though there were more processes on their side of the network split.