Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
 
class QuorumPacket { 
    int type; // Request, Ack, Commit, Ping, etc
    long zxid; 
    buffer data; 
    vector<org.apache.zookeeper.data.Id> authinfo; // only used for requests
} 

The following structure is used the SERVERINFO packet:

Code Block

class Info {
    long sid; // server id
    int version; // protocol version
    int currentEpoch;
    int acceptedEpoch;
}

There are a couple of different types of packets:

type

zxid

data

notation

meaning

SERVERINFO(11) aka FOLLOWERINFO

last zxid accepted

n/a Info

SERVERINFO(lastZxid, info)

The follower has accepted up to lastZxid.

NEWEPOCH(9)

newEpoch << 32

n/a

NEWEPOCH(newEpoch)

A proposal to move to a new epoch.

DIFF(13)

last committed zxid

n/a

DIFF(lastCommittedZxid)

lastCommittedZxid is the last zxid committed by the leader.

TRUNC(14)

truncZxid

n/a

TRUNC(truncZxid)

Truncate the history to truncZxid

SNAP(15)

n/a

n/a

SNAP

A state transfer (aka snapshot) will be sent to the follower. this will be a fuzzy state transfer that may include zxids being sent to the follower. The state transfer will immediately follow this packet.

OBSERVERINFO(16)

last zxid accepted

n/a

OBSERVERINFO(lastZxid)

The observer has accepted up to lastZxid.

NEWLEADER(10)

lastZxid

n/a

NEWLEADER(lastZxid)

Accept this leader as the leader of the epoch of lastZxid.

UPTODATE(12)

n/a

n/a

UPTODATE

The follower is now uptodate enough to begin serving clients.

PROPOSAL(2)

zxid of proposal

proposed message

PROPOSAL(zxid, data)

Propose a message. (Request that it be accepted into a followers history.)

ACK(3)

zxid of proposal to ack

n/a

ACK(zxid)

Everything sent to the follower by the leader up to zxid has been accepted into its history (logged to disk).

COMMIT(4)

zxid of proposal to commit

n/a

COMMIT(zxid)

Everything in the followers history up to zxid should be committed (aka delivered).

INFORM(8)

zxid of proposal

data of proposal

INFORM(zxid, data)

Deliver the data. (Only used with observers.) i

...

All servers start off looking for a leader. Once the instance of leader election at a given server indicates a leader has emerged it will move to phase 1. If the leader election instance indicates that the server is the leader, it moves to phase 1 as a leader, otherwise it moves to phase 1 as a follower.

Phase 1: Establish an epoch

...

In this phase an elected leader establishes leadership and syncs up with followers. It starts a new epoch and syncs with a quorum of followers to make sure that it has the most up to date makes sure that previous leaders cannot commit new proposals and decides on an initial history. (Note a leader is also considered a follower of itself.)

Any failures or timeouts will cause the server to go back to leader election.

  1. l The leader starts accepting connections from followers.
  2. f Followers connect the the leader and send SERVERINFO(lastZxid, info).
  3. l The leader stops waits for a quorum of followers to connect all of whom has:
    • f.currentEpoch <= l.currentEpoch
    • if f.currentEpoch == l.currentEpoch, then f.lastZxid < l.lastZxid
  4. l Once the leader has quorum, it stops accepting connections, and sends NEWEPOCH(e) to all followers, where e is greater than all f.acceptedEpoch in the quorum.
  5. f Followers send ACK(e<<32) if e > f.acceptedEpoch. (If e < f.acceptedEpoch, the follower will abandon the leader. If e == f.acceptedEpoch, the follower will not send ACK(e<<32), but will still maintain the connection to the leader.)
  6. l The leader waits for all followers to ack NEWEPOCH(e)

Sync with followers

  1. l
  2. A process whose leader election instance indicates that it is the leader will
    1. the epoch, e, of its highest lastZxid;
    2. increment e;
    3. start accepting connections from followers;
  3. Followers open a TCP connection to a leader that their instance of the leader election indicates and send FOLLOWER(lastZxid).
  4. The leader does the following with each follower that connects: connected to it:
      sends a NEWLEADER(lastZxid);
    1. adds the follower to the list of connections to send new proposals, so while the server is performing the next steps, it is queuing up any new proposals sent to the follower.
    2. does one of the following:
      • SNAP if the follower is so far behind that it is better to do a state transfer than send missing transactions.
      • TRUNC(zxid) if the follower has transactions that the leader has chosen to skip. The leader sets zxid to the last zxid in its history for the epoch of the follower.
      • DIFF if the leader is sending transactions that the follower is missing. The leader sends missing messages to the follower.
      The leader queues an UPTODATE packet.
    3. sends a NEWLEADER(e);
    4. The leader releases any queued messages to the follower.
    When the follower
  5. f The follower syncs with the leader, but doesn't modify its state until it receives the NEWLEADER(lastZxid) message, it makes sure that the follower has at least the last epoch in its history e) packet. Once it does it applies any changes and sends ACK(lastZxide << 32).
  6. l Once the leader has received an acknowledgements from a quorum of followers, it takes leadership of epoch e and queues UPTODATE to all followers.

Phase 3: Broadcast

The leader and followers can have multiple proposals in process and at various stages in the pipeline. The leader will remain active as long as there is a quorum of followers acknowledging its proposals or pings within a timeout interval. The follofollower will continue to support a leader as long as it receives proposals or pings within a timeout interval. Any failures or timeouts will result in the server going back to leader election.

  1. l the leader queues a packet PROPOSE(zxid, data) to all connected followers.
  2. f a follower will log and sync proposals to disk and send ACK(zxid).
  3. l when the leader receives acknowledgements from a quorum of followers it queues COMMIT(zxid) to all followers.