Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Status

Current state: Under vote Accepted

Discussion thread:  https://lists.apache.org/thread/wtx9qcr613gyqtm0bx8rlsckrg5pl276

JIRA:   KAFKA-15183

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

...

The motivation of this KIP is to add some more metrics to KRaft mode, for the purpose of measuring performance. Most of these metrics are focused on KRaft mode.

Public Interfaces

We propose adding the following new metrics, which will be present only in KRaft mode.:

NameContextTypeModeDescription
kafka.controller:type=KafkaController,name=TimedOutBrokerHeartbeatCountControllerLongKRaft only

The number of broker heartbeats that timed out on this controller since the process was started. Note that only active controllers handle heartbeats, so only they will see increases in this metric.

kafka.controller:type=KafkaController,name=
EventQueueOperationsPerformedCount
EventQueueOperationsStartedCountControllersLongKRaft onlyThe total number of controller event queue operations that were
performed
started. This includes deferred operations.
kafka.controller:type=KafkaController,name=EventQueueOperationsTimedOutCountControllersLongKRaft onlyThe total number of controller event queue operations that timed out before they could be performed.
kafka.controller:type=KafkaController,name=NewActiveControllersCountControllerLongKRaft onlyCounts the number of times this node has seen a new controller elected. A transition to the "no leader" state is not counted here. If the same controller as before becomes active, that still counts.
kafka.server:type=MetadataLoader,name=CurrentMetadataVersionBroker and ControllerIntegerKRaft onlyOutputs the feature level of the current effective metadata version.
kafka.server:type=MetadataLoader,name=HandleLoadSnapshotCountBroker and ControllerLongKRaft onlyThe total number of times we have loaded a KRaft snapshot since the process was started.
kafka.server:type=
MetadataLoader
SnapshotEmitter,name=LatestSnapshotGeneratedBytesBroker and ControllerLongKRaft onlyThe total size in bytes of the latest snapshot that the node has generated. If none have been generated yet, this is the size of the latest snapshot that was loaded. If no snapshots have been generated or loaded, this is 0.
kafka.server:type=
MetadataLoader
SnapshotEmitter,name=LatestSnapshotGeneratedAgeMsBroker and ControllerLongKRaft onlyThe interval in miliseconds since the latest snapshot that the node has generated. If none have been generated yet, this is approximately the time delta since the process was started.
kafka.server:type=ForwardingManager,name=QueueTimeMsBrokerHistogramKRaft and ZKA histogram describing the amount of time in milliseconds each admin request spends in the broker's forwarding manager queue, waiting to be sent to the controller. This does not include the time that the request spends waiting for a response from the controller.
kafka.server:type=ForwardingManager,name=QueueLengthBrokerIntegerKRaft and ZKThe current number of RPCs that are waiting in the broker's forwarding manager queue, waiting to be sent to the controller.
kafka.server:type=ForwardingManager,name=RemoteTimeMsBrokerHistogramKRaft and ZKA histogram describing the amount of time in milliseconds each request sent by the ForwardingManager spends waiting for a response. This does not include the time spent in the queue.

Implementation Notes

Lockless

In order to avoid excessive performance impacts from these new metrics, none of them will require locks to read. (Except for any locks inside the Yammer metrics library, JMX implementation, and so on.)

...

This metric is useful to monitor because when broker heartbeats are timing out, that indicates a performance problem on the active controller.

...

EventQueueOperationsStarted

This is a rough measure of how busy the controller is. This lets us know how many operations per second different quorum controller clusters can perform.

...