The contents of this KIP were authored by Jun Rao.
Status
Current state: Under Discussion
Discussion thread:
JIRA:
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
Ensuring that the Kafka Controller is healthy is an important part of monitoring the health of a Kafka Cluster. However, the metrics currently exposed are not sufficient for reliably detecting issues like slow progress or deadlocks. We propose a few new metrics that will solve this issue. Even though KAFKA-5028 will potentially fix existing deadlocks, there will still be known (and potentially unknown) issues that can cause slow or no progress so these metrics will still be useful.
Public Interfaces
All of the following will be added via the Yammer metrics library like most of the broker metrics.
Controller Metrics
(1) kafka.controller:type=KafkaController,name=ControllerState
type: gauge
value: reporting the state the controller is in
Valid states:
0 - idle
1 - starting
2 - resigning
3 - broker change
4 - topic creation
5 - topic deletion
6 - partition reassigning
7 - auto leader balancing
8 - manual leader balancing
9 - controlled shutdown
(2). kafka.controller:type=ControllerStats,name=ControllerStartRateAndTimeMs
type: timer
value: rate and latency for the controller to start
(3). kafka.controller:type=ControllerStats,name=ControllerResignRateAndTimeMs
type: timer
value: rate and latency for the controller to resign
(4). kafka.controller:type=ControllerStats,name=TopicCreationRateAndTimeMs
type: timer
value: rate and latency for the controller to create new topics
(5). kafka.controller:type=ControllerStats,name=TopicDeletionRateAndTimeMs
type: timer
value: rate and latency for the controller to delete topics
(6). kafka.controller:type=ControllerStats,name=PartitionReassigningRateAndTimeMs
type: timer
value: rate and latency for the controller to reassign partitions
(7). kafka.controller:type=ControllerStats,name=AutoLeaderBalancingRateAndTimeMs
type: timer
value: rate and latency for the controller to auto balance the leaders
(8). kafka.controller:type=ControllerStats,name=ManualLeaderBalancingRateAndTimeMs
type: timer
value: rate and latency for the controller to manually balance the leaders
(9). kafka.controller:type=ControllerStats,name=QueueSize
type: gauge
value: the size of the queue
(10). kafka.controller:type=ControllerStats,name=QueueTimeMs
type: histogram
value: how long an event is waiting in the controller queue before being processed
ControllerChannelManager Metrics
We also want to know the size of the queue in ControllerChannelManager:
(11) kafka.controller:type=ControllerChannelManager,name=TotalQueueSize
type: gauge
(12) kafka.controller:type=ControllerChannelManager,name=QueueSize,brokerId=10
type: gauge
Partition Metrics
Since quite a few jiras reported continuous errors due to ""Cached zkVersion 54 not equal to that in zookeeper, skip updating ISR". It would be useful to measure the occurrences of failed ISR update in ZK.
(13) kafka.cluster:type=Partition,name=FailedIsrUpdatesPerSec
type: meter
Proposed Changes
We will add the relevant metric type to one of KafkaController, ControllerStats, ControllerChannelManager or Partition as specified in the Public Interfaces section.
Compatibility, Deprecation, and Migration Plan
We are introducing new metrics so there is no compatibility impact.
Rejected Alternatives
- Don't add these metrics: it's currently difficult to detect these issues, they impact cluster health and the overhead of the proposed metrics is low.
- Use Kafka metrics instead of Yammer metrics: most of the broker metrics use Yammer Metrics so it makes sense to stick with that until we have a plan on how to migrate them all to Kafka Metrics.