Status
Current state: Under Discussion
Discussion thread:
JIRA:
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
The motivation of this KIP is to add some more metrics to KRaft mode, for the purpose of measuring performance.
Public Interfaces
We propose adding the following new metrics, which will be present only in KRaft mode.
Name | Context | Type | Description |
---|---|---|---|
kafka.controller:type=KafkaController,name=TimedOutBrokerHeartbeatCount | Controller | Long | The number of broker heartbeats that timed out on this controller since the process was started. Note that only active controllers handle heartbeats, so only they will see increases in this metric. |
kafka.controller:type=KafkaController,name=EventQueueOperationsPerformedCount | Controllers | Long | The total number of event queue operations that were performed. This includes deferred operations. |
kafka.controller:type=KafkaController,name=EventQueueOperationsTimedOutCount | Controllers | Long | The total number of event queue operations that timed out before they could be performed. |
kafka.controller:type=KafkaController,name=NewActiveControllersCount | Controller | Long | Counts the number of times this node has seen a new controller elected. A transition to the "no leader" state is not counted here. If the same controller as before becomes active, that still counts. |
kafka.server:type=MetadataLoader,name=CurrentMetadataVersion | Broker and Controller | Integer | Outputs the current effective metadata version as an integer value. |
kafka.server:type=MetadataLoader,name=HandleLoadSnapshotCount | Broker and Controller | Long | The total number of times we have loaded a KRaft snapshot since the process was started. |
Implementation Notes
In order to avoid excessive performance impacts from these new metrics, none of them will require additional locks.
Rationale
TimedOutBrokerHeartbeats
This metric is useful to monitor because when broker heartbeats are timing out, that indicates a performance problem on the active controller.
EventQueueOperationsPerformed
This is a rough measure of how busy the controller is. This lets us know how many operations per second different quorum controller clusters can perform.
EventQueueOperationsTimedOut
This is a rough measure of how much load we are shedding by means of timeouts. If we see this increase faster than TimedOutBrokerHeartbeats, we know that operations other than heartbeats are being impacted by timeouts.
NewActiveControllersCount
The main reason to monitor this metric is to make sure we are not electing too many new controllers per minute.
CurrentMetadataVersion
This metric simply reflects the current metadata version. It is useful for administrators with multiple clusters, who want to ensure that they're all up-to-date. It also is helpful to know at a glance when a metadata version transition occurred.
HandleLoadSnapshotCount
This metric counts the number of times we have loaded a metadata snapshot. This is an O(N) operation since it involves reloading the full metadata state. So it's helpful to know when this has occurred.
Compatibility, Deprecation, and Migration Plan
These will be newly exposed metrics and there will be no impact on existing kafka versions.
Test Plan
We will add junit tests to verify the new metrics.
Rejected Alternatives
Rather than adding NewActiveControllersCount, we could monitor existing metrics such as the metadata log epoch, or the ActiveCountroller metric that is either 0 or 1. However, these alternatives are not as good.
- The metadata log epoch may increase multiple times during a Raft election even when only one new leader results.
- Measuring transitions in the ActiveController metric is difficult in many metrics collection systems. It's also easy to lose track of a transition if the sampling period is too long.