Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Added more details to the state reported in the new fields for the leader itself

...

Code Block
  "apiKey": 55,
  "type": "response",
  "name": "DescribeQuorumResponse",
  "validVersions": "0-1",
  "flexibleVersions": "0+",
  "fields": [
    { "name": "ErrorCode", "type": "int16", "versions": "0+",
      "about": "The top level error code."},
    { "name": "Topics", "type": "[]TopicData",
      "versions": "0+", "fields": [
      { "name": "TopicName", "type": "string", "versions": "0+", "entityType": "topicName",
        "about": "The topic name." },
      { "name": "Partitions", "type": "[]PartitionData",
        "versions": "0+", "fields": [
        { "name": "PartitionIndex", "type": "int32", "versions": "0+",
          "about": "The partition index." },
        { "name": "ErrorCode", "type": "int16", "versions": "0+"},
        { "name": "LeaderId", "type": "int32", "versions": "0+", "entityType": "brokerId",
          "about": "The ID of the current leader or -1 if the leader is unknown."},
        { "name": "LeaderEpoch", "type": "int32", "versions": "0+",
          "about": "The latest known leader epoch"},
        { "name": "HighWatermark", "type": "int64", "versions": "0+"},
        { "name": "CurrentVoters", "type": "[]ReplicaState", "versions": "0+" },
        { "name": "Observers", "type": "[]ReplicaState", "versions": "0+" }
      ]}
    ]}],
  "commonStructs": [
      { "name": "ReplicaState", "versions": "0+", "fields": [
      { "name": "ReplicaId", "type": "int32", "versions": "0+", "entityType": "brokerId" },
      { "name": "LogEndOffset", "type": "int64", "versions": "0+",
        "about": "The last known log end offset of the follower or -1 if it is unknown"},
      { "name": "LastFetchTime", "type": "int64", "versions": "1+",
        "about": "The last known leader wall clock time time when a follower fetched from the leader or. This is reported as -1 both for the current leader or if it is unknown for a voter"},
      { "name": "LastCaughtUpTime", "type": "int64", "versions": "1+",
        "about": "The leader wall clock append time of the offset for which the follower made the most recent fetch request or. This is reported as the current time for the leader and -1 if unknown for a voter"}
    ]}
  ]


Proposed Changes

This KIP proposes exposing the DescribeQuorum  API to the admin client and adding two new fields (per voter) to the DescribeQuorum API response.

...

  1. Last Fetch Time
    This metric will be reported for each voter. This is a good approximation of the “liveness” of the voters and can be used to detect a network partition in the quorum.
    This information is already known to the leader for all voters and only needs to be added to the response

  2. Last Caught Up Time
    This metric will be reported for each voter. This is akin to the metric used to track lag for replicas in ISR and it measures the approximate lag between the leader and the replica based on the offsets requested in the fetch requests and when they were made.
    To compute this metric, the Replica state maintains a few bits of information about fetch requests as they are received. The leader tracks the time, requested offset and the leader's own end offset for the most recent fetch request it processes for each replica. Whenever a new fetch request comes in the replica's last caught up time is updated to the time of the fetch request if it requests an offset greater than the leader's current end offset. The offset is also compared to the leader's end offset when the previous request was received and the caught up time is updated to the time of that fetch request if the offset is greater than that. This gives a notion of whether a replica is within an acceptable bound of lag from the leader, and is more accurate than relying solely on the offset difference as it models the leader's load into the computation.
    Some of this is not tracked in the information that the Raft layer stores for each voter at this time but it can be easily added in. The cost to track this information is minimal and it only requires some additional bookkeeping during pre-existing processing for a fetch.
    NOTE: Given the leader is always caught up to itself, the Last Caught Up Time for the leader will be the leader's wall clock time when it responded to the DescribeQuorum request.

Compatibility, Deprecation, and Migration Plan

...