...
metrics
- a comma-separated list of metric name prefixes, e.g.,"client.producer.partition., client.io.wait"
. Whitespaces are ignored.interval
- metrics push interval in milliseconds. Defaults to 5 minutes if not specified.match_<selector>
- Client matching selector that is evaluated as an anchored regexp (i.e., "something.*" is treated as "^something.*$"). Any client that matches all of thematch_..
selectors will be eligible for this metrics subscription. Initially supported selectors:client_instance_id
- CLIENT_INSTANCE_ID UUID string representation.client_software_name
- client software implementation name.client_software_version
- client software implementation version.client_source_address
- client connection's source address from the broker's point of view.client_source_port
- client connection's source port from the broker's point of view.
Example using the standard kafka-configs.sh tool:
...
Broker-added labels
The following labels are should be added by the broker plugin as metrics are received
Label name | Description |
client_instance_id | The generated CLIENT_INSTANCE_ID. |
client_id | client.id as reported in the Kafka protocol header. |
client_software_name | The client’s implementation name as reported in ApiVersionRequest. |
client_software_version | The client’s version as reported in ApiVersionRequest. |
client_source_address | The client connection’s source address. |
client_source_port | The client connection’s source port. |
principal | Client’s security principal. Content depends on authentication method. |
broker_id | Receiving broker’s node-id. |
...
Before sending the alert to the incident management system the monitoring system collects a set of labels that are associated with this CLIENT_INSTANCE_ID, such as:
- client.id
- client_source_address and client_source_port on broker id X (1 or more such mappings based on how many connections the client has used to push metrics).
- principal
- tenant
- client_software_name and client_software_version
- In case of consumer: group_id, group_instance_id (if configured) and the latest known group_member_id.
- In case of transactional producer: transactional_id
...
The Kafka cluster configuration for metrics collection (i.e., metrics subscriptions) is irrelevant to this use-case, given that the proper a metrics plugin is enabled on the brokers. The metrics plugin is configured to write metrics to a topic. A support system with an interactive interface is reading from this metrics topic, and has an Admin client to configure the cluster with desired metrics subscriptions.
The application owner reports a lagging consumer that is not able to keep up with the incoming message rate and asks for the Kafka operator to help troubleshoot. The application owner, who unfortunately does not know the client instance id of the consumer, provides the client.id, userid, and source address.
The Kafka operator adds a metrics subscription for metrics matching prefix “org.apache.kafka.client.consumer.” and with the corresponding client_id and source_address as metrics matching selectors selector. Since this is a live troubleshooting case the metrics push interval is set to a low 10 seconds.
...
Upon the next PushTelemetryRequest, which now includes metrics for the subscribed metrics, the metrics are written to the output topic and the PushIntervalMs is adjusted to the configured interval of 10 seconds. This repeats until the metrics subscription configuration is changed.As the consumer metrics are now being written to the metrics topic the support system reads the metrics, sees that there is an active viewer for
Multiple consumers from the same source address may now be pushing metrics to the cluster. The support system starts receiving the metrics and soon finds a metric push from the desired client.id which now provides a mapping from client.id to client_instance_id. At this point the metrics subscription may be altered to only match the client_instance_id of the matching client. But in either case the metrics matching the given client.id , and displays the metrics are displayed to the operator.
The operator identifies an increasing trend in client.consumer.processing.time which indicates slow per-message processing in the application and reports this back to the application owner, ruling out the client and Kafka cluster from the problem space.
...