...
Table of Contents |
---|
Status
Current state: Under Discussion Adopted
Discussion thread: here
JIRA:
Jira | ||||||
---|---|---|---|---|---|---|
|
...
(1) kafka.controller:type=KafkaController,name=ControllerState
type: gauge
value: reporting the state the controller is in, i.e. the event that is currently being processed. Some actions like partition reassignment may take a while and include many events (potentially interleaved with other events), but that doesn't change the fact that at most one event is processed at a time.
Valid states (events comprising that state in brackets) is in
Valid states:
0 - idle
1 - starting controller change (Startup, ControllerChange, Reelect)
2 - resigning
3 - broker change (BrokerChange)
4 3 - topic creation
5 /change (TopicChange, PartitionModifications)
4 - topic deletion (TopicDeletion, TopicDeletionStopReplicaResult)
6 5 - partition reassigning
7 reassignment (PartitionReassignment,
PartitionReassignmentIsrChange)
6 - auto leader balancing
8 balance (AutoPreferredReplicaLeaderElection)
7 - manual leader balancing
9 balance (PreferredReplicaLeaderElection)
8 - controlled shutdown (ControlledShutdown)
9 - isr change (IsrChangeNotification)
For each state, there's a timer with the rate and time with 2 exceptions: BrokerChange (currently tracked as LeaderElectionRateAndTimeMs) and ControlledShutdown (tracked via RequestQueueTimeMs for the the ControlledShutdown request).
(12). kafka.controller:type=ControllerStats,name=ControllerStartRateAndTimeMs
type: timer
value: rate and latency for the controller to start(3). kafka.controller:type=ControllerStats,name=ControllerResignRateAndTimeMsControllerChangeRateAndTimeMs
type: timer
value: rate and latency for the controller to resignchange state
(42). kafka.controller:type=ControllerStats,name=TopicCreationRateAndTimeMsTopicChangeRateAndTimeMs
type: timer
value: rate and latency for the controller to create new topics
(53). kafka.controller:type=ControllerStats,name=TopicDeletionRateAndTimeMs
type: timer
value: rate and latency for the controller to delete topics
(64). kafka.controller:type=ControllerStats,name=PartitionReassigningRateAndTimeMsPartitionReassignmentRateAndTimeMs
type: timer
value: rate and latency for the controller to reassign partitions
(75). kafka.controller:type=ControllerStats,name=AutoLeaderBalancingRateAndTimeMsAutoLeaderBalanceRateAndTimeMs
type: timer
value: rate and latency for the controller to auto balance the leaders
(86). kafka.controller:type=ControllerStats,name=ManualLeaderBalancingRateAndTimeMsManualLeaderBalanceRateAndTimeMs
type: timer
value: rate and latency for the controller to manually balance the leaders
(97) kafka. kafka.controller:type=ControllerStats,name=QueueSizeIsrChangeRateAndTimeMs
type: gaugetimer
value: the size of the queue
(10). kafka.controller:type=ControllerStats,name=QueueTimeMs
type: histogram
value: how long an event is waiting in the controller queue before being processed
rate and latency for the controller to manually balance the leaders
ControllerChannelManager Metrics
We also want to know the size of the queue in ControllerChannelManager:
(119) kafka.controller:type=ControllerChannelManager,name=TotalQueueSize
type: gauge
(1210) kafka.controller:type=ControllerChannelManager,name=QueueSize,brokerId=10
...
We are introducing new metrics so there is no compatibility impact.
Rejected Alternatives
- Don't add these metrics: it's currently difficult to detect these issues, they impact cluster health and the overhead of the proposed metrics is low.
- Use Kafka metrics instead of Yammer metrics: most of the broker metrics use Yammer Metrics so it makes sense to stick with that until we have a plan on how to migrate them all to Kafka Metrics.
Future work
- KAFKA-5028 introduced a queue for Controller events. It would be useful to have a gauge for the queue size and a histogram for how long an event waits in the queue before being processed. However, we are in the process of making additional changes to improve the handling of soft failures and there's a possibility that the controller queue could be replaced by a broker queue for all ZK communication. We will see how that develops before deciding which metrics should be exposed. In the meantime, the ControllerState and other metrics should provide enough information to issue an alert if the Controller is not healthy.
...