Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Address KIP discussion feedback

...

Ensuring that the Kafka Controller is healthy is an important part of monitoring the health of a Kafka Cluster. However, the metrics currently exposed are not sufficient for reliably detecting issues like slow progress or deadlocks. We propose a few new metrics that will solve this issue. Even though KAFKA-5028 will potentially fix existing deadlocks, there will still be known (and potentially unknown) issues that can cause slow or no progress so these metrics will still be useful.

Public Interfaces

...

All of the following will be added via the Yammer metrics library like most of the broker metrics.

Controller Metrics

(1) kafka.controller:type=KafkaController,name=ControllerState:

type: Gaugegauge

value: reporting the state the controller is in 
Valid states:
0 - idle
1 - starting
2 - resigning
3 - broker change
4 - topic creation
5 - topic deletion
6 - partition reassigning
7 - auto leader balancing
8 - manual leader balancing
9 - controlled shutdown

...

(8). kafka.controller:type=ControllerStats,name=ManualLeaderBalancingRateAndTimeMs
type: timer
value: rate and latency for the controller to manually balance the leaders

(9). kafka.controller:type=ControllerStats,name=ControlledShutdownRateAndTimeMsQueueSize

 type: timergauge

value: rate and latency for the controller to shut down a broker in a controlled wayNo need for ControllerStats.BrokerChangeRateAndTimeMs since there is already an existing LeaderElectionRateAndTimeMs.the size of the queue

(10). kafka.controller:type=ControllerStats,name=QueueTimeMs

type: histogram

value: how long an event is waiting in the controller queue before being processed

ControllerChannelManager Metrics

We also want to know the size of the queue in ControllerChannelManager:

(1011) kafka.controller:type=ControllerChannelManager,name=TotalQueueSize

type: gauge

(1112) kafka.controller:type=ControllerChannelManager,name=QueueSize,brokerId=10

type: gauge

Partition Metrics

Since quite a few jiras reported continuous errors due to ""Cached zkVersion 54 not equal to that in zookeeper, skip updating ISR". It would be useful to measure the occurrences of failed ISR update in ZK.

(1213) kafka.cluster:type=Partition,name=FailedIsrUpdateRateFailedIsrUpdatesPerSec

type: meter

Proposed Changes

We will add the relevant metric type to one of KafkaController, ControllerStats, ControllerChannelManager or Partition as specified in the Public Interfaces section.

...

  • Don't add these metrics: it's currently difficult to detect these issues, they impact cluster health and the overhead of the proposed metrics is low.
  • Use Kafka metrics instead of Yammer metrics: most of the broker metrics use Yammer Metrics so it makes sense to stick with that until we have a plan on how to migrate them all to Kafka Metrics.