You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

The contents of this KIP were authored by Jun Rao.

Status

Current state: Draft

Discussion thread:

JIRA:

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

We are aware of a number of controller issues that we intend to resolve in a redesign. That effort is just starting so it will be a while before it will be ready. In the meantime, we need to be able to detect issues.

For known controller issues, such as deadlocks and slow progress, we want to make sure we have a reliable way to detect that it has occurred.

Public Interfaces

New metrics in controller:
(1) kafka.controller:type=KafkaController,name=ControllerState: 
type: Gauge
value: reporting the state the controller is in 
Valid states:
0 - idle
1 - starting
2 - resigning
3 - broker change
4 - topic creation
5 - topic deletion
6 - partition reassigning
7 - auto leader balancing
8 - manual leader balancing
9 - controlled shutdown

(2). kafka.controller:type=ControllerStats,name=ControllerStartRateAndTimeMs
type: timer
value: rate and latency for the controller to start

(3). kafka.controller:type=ControllerStats,name=ControllerResignRateAndTimeMs
type: timer
value: rate and latency for the controller to resign

(4). kafka.controller:type=ControllerStats,name=TopicCreationRateAndTimeMs
type: timer
value: rate and latency for the controller to create new topics

(5). kafka.controller:type=ControllerStats,name=TopicDeletionRateAndTimeMs
type: timer
value: rate and latency for the controller to delete topics

(6). kafka.controller:type=ControllerStats,name=PartitionReassigningRateAndTimeMs
type: timer
value: rate and latency for the controller to reassign partitions

(7). kafka.controller:type=ControllerStats,name=AutoLeaderBalancingRateAndTimeMs
type: timer
value: rate and latency for the controller to auto balance the leaders

(8). kafka.controller:type=ControllerStats,name=ManualLeaderBalancingRateAndTimeMs
type: timer
value: rate and latency for the controller to manually balance the leaders

(9). kafka.controller:type=ControllerStats,name=ControlledShutdownRateAndTimeMs
type: timer
value: rate and latency for the controller to shut down a broker in a controlled way

No need for ControllerStats.BrokerChangeRateAndTimeMs since there is already an existing LeaderElectionRateAndTimeMs.

We also want to know the size of the queue in ControllerChannelManager

(10) kafka.controller:type=ControllerChannelManager,name=TotalQueueSize

(11) kafka.controller:type=ControllerChannelManager,name=QueueSize,brokerId=10

Since quite a few jiras reported continuous errors due to ""Cached zkVersion 54 not equal to that in zookeeper, skip updating ISR". It would be useful to measure the occurrences of failed ISR update in ZK.

(12) kafka.cluster:type=Partition,name=FailedIsrUpdateRate

Proposed Changes

Describe the new thing you want to do in appropriate detail. This may be fairly extensive and have large subsections of its own. Or it may be a few sentences. Use judgement based on the scope of the change.

Compatibility, Deprecation, and Migration Plan

We are introducing new metrics so there is no compatibility impact.

Rejected Alternatives

  • Don't add these metrics: it's currently difficult to detect these issues, they impact cluster health and the overhead of the proposed metrics is low.

 

 

  • No labels