Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Update KIP to match what was merged.

The contents of this KIP were authored by Jun Rao.

Table of Contents

Status

Current state: Draft Adopted

Discussion thread: here

JIRA:

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyKAFKA-5135

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

We are aware of a number of controller issues that we intend to resolve in a redesign. That effort is just starting so it will be a while before it will be ready. In the meantime, we need to be able to detect issues.

For known controller issues, such as deadlocks and slow progress, we want to make sure we have a reliable way to detect that it has occurred.

Public Interfaces

...

Ensuring that the Kafka Controller is healthy is an important part of monitoring the health of a Kafka Cluster. However, the metrics currently exposed are not sufficient for reliably detecting issues like slow progress or deadlocks. We propose a few new metrics that will solve this issue. Even though KAFKA-5028 will potentially fix existing deadlocks, there will still be known (and potentially unknown) issues that can cause slow or no progress so these metrics will still be useful.

Public Interfaces

All of the following will be added via the Yammer metrics library like most of the broker metrics. Retrieving a metric value will not acquire any Controller locks (which was an issue in the past).

Controller Metrics

(1) kafka.controller:type=KafkaController,name=ControllerState

type: Gaugegauge

value: reporting the state the controller is in, i.e. the event that is currently being processed. Some actions like partition reassignment may take a while and include many events (potentially interleaved with other events), but that doesn't change the fact that at most one event is processed at a time.

Valid states (events comprising that state in brackets) is in 
Valid states:

0 - idle
1 - starting controller change (Startup, ControllerChange, Reelect)
2 - resigning
3 - broker change (BrokerChange)
4 3 - topic creation
5 /change (TopicChange, PartitionModifications)
4 - topic deletion (TopicDeletion, TopicDeletionStopReplicaResult)
6 5 - partition reassigning
7 reassignment (PartitionReassignment,
PartitionReassignmentIsrChange)
6 - auto leader balancing
8 balance (AutoPreferredReplicaLeaderElection)
7 - manual leader balancing
9 balance (PreferredReplicaLeaderElection)
8 - controlled shutdown (ControlledShutdown)
9 - isr change (IsrChangeNotification)

For each state, there's a timer with the rate and time with 2 exceptions: BrokerChange (currently tracked as LeaderElectionRateAndTimeMs) and ControlledShutdown (tracked via RequestQueueTimeMs for the the ControlledShutdown request).

(12). kafka.controller:type=ControllerStats,name=ControllerStartRateAndTimeMs
type: timer
value: rate and latency for the controller to start(3). kafka.controller:type=ControllerStats,name=ControllerResignRateAndTimeMsControllerChangeRateAndTimeMs
type: timer
value: rate and latency for the controller to resignchange state

(42). kafka.controller:type=ControllerStats,name=TopicCreationRateAndTimeMsTopicChangeRateAndTimeMs

type: timer
value: rate and latency for the controller to create new topics

(53). kafka.controller:type=ControllerStats,name=TopicDeletionRateAndTimeMs
type: timer
value: rate and latency for the controller to delete topics

(64). kafka.controller:type=ControllerStats,name=PartitionReassigningRateAndTimeMsPartitionReassignmentRateAndTimeMs
type: timer
value: rate and latency for the controller to reassign partitions

(75). kafka.controller:type=ControllerStats,name=AutoLeaderBalancingRateAndTimeMsAutoLeaderBalanceRateAndTimeMs
type: timer
value: rate and latency for the controller to auto balance the leaders

(86). kafka.controller:type=ControllerStats,name=ManualLeaderBalancingRateAndTimeMsManualLeaderBalanceRateAndTimeMs
type: timer
value: rate and latency for the controller to manually balance the leaders

(97). kafka kafka.controller:type=ControllerStats,name=ControlledShutdownRateAndTimeMsIsrChangeRateAndTimeMs

type: timer
value: rate and latency for the controller to shut down a broker in a controlled waymanually balance the leaders

ControllerChannelManager MetricsNo need for ControllerStats.BrokerChangeRateAndTimeMs since there is already an existing LeaderElectionRateAndTimeMs.

We also want to know the size of the queue in ControllerChannelManager:

(109) kafka.controller:type=ControllerChannelManager,name=TotalQueueSize

type: gauge

(1110) kafka.controller:type=ControllerChannelManager,name=QueueSize,brokerId=10

type: gauge

Partition Metrics

Since quite a few jiras reported continuous errors due to ""Cached zkVersion 54 not equal to that in zookeeper, skip updating ISR". It would be useful to measure the occurrences of failed ISR update in ZK.

(1213) kafka.cluster:type=Partition,name=FailedIsrUpdateRateFailedIsrUpdatesPerSec

type: meter

Proposed Changes

Describe the new thing you want to do in appropriate detail. This may be fairly extensive and have large subsections of its own. Or it may be a few sentences. Use judgement based on the scope of the changeWe will add the relevant metric type to one of KafkaController, ControllerStats, ControllerChannelManager or Partition as specified in the Public Interfaces section.

Compatibility, Deprecation, and Migration Plan

We are introducing new metrics so there is no compatibility impact.

Rejected Alternatives

  1. Don't add these metrics: it's currently difficult to detect these issues, they impact cluster health and the overhead of the proposed metrics is low

...

  1. .
  2. Use Kafka metrics instead of Yammer metrics: most of the broker metrics use Yammer Metrics so it makes sense to stick with that until we have a plan on how to migrate them all to Kafka Metrics.

Future work

  1. KAFKA-5028 introduced a queue for Controller events. It would be useful to have a gauge for the queue size and a histogram for how long an event waits in the queue before being processed. However, we are in the process of making additional changes to improve the handling of soft failures and there's a possibility that the controller queue could be replaced by a broker queue for all ZK communication. We will see how that develops before deciding which metrics should be exposed. In the meantime, the ControllerState and other metrics should provide enough information to issue an alert if the Controller is not healthy.

 

...