Motivation

Being able to centrally , proactively, and reactively, monitor and troubleshoot problems with Kafka clients is becoming increasingly important as the use of Kafka is expanding within organizations as well as for hosted Kafka services. The typical Kafka client user is now an application owner with little experience in operating Kafka clients, while the cluster operator has profound Kafka knowledge but little insight in the client application.

...

Metrics will be named in a hierarchical format:

<namespace>/.<metric.name>

E.g.:

org.apache.kafka/.client.producer.partition.queue.bytes

The metric namespace is org.apache.kafka for the standard metrics, vendor/implementation specific metrics may be added in separate namespaces, e.g:

librdkafka/.client.producer.xmitq.latency

io.confluent/.client.python.object.count

The Apache Kafka Java client would provide implementation specific metrics in:

org.apache.kafka/.client.java.producer.socket.buffer.full.count

...

All standard metrics are prefixed with “org.apache.kafka/.”, this prefix is omitted from the following tables for brevity.

A client implementation must not add additional metrics or labels under the “org.apache.kafka/.” prefix without a corresponding accepted KIP.

...

The default metrics collection on the client must take extra care not to expose any information about the application and system it runs on as they may identify internal projects, infrastructure, etc, that the user/customer may not want to expose to the Kafka infrastructure owner. This includes information such as hostname, operating system, credentials, runtime environment, etc.

Pushing these types of metric, in particular the runtime environment, could on the other hand be valuable in troubleshooting and may be provided as an opt-in configuration property. These types of metrics are referred to as private metrics and must not be enabled by default .The configuration property to enable these metrics is enable.private.metrics.push=true. This configuration property and these private metrics and labels (enable.telemetry.host.metrics=false) and are optional to implement.The metrics and labels covered by this constraint are indicated in-place in the following tables.

OpenTelemetry specifies a range of relevant metrics:

Metric types

The metric types in the following tables correspond to the OpenTelemetry v1 metrics protobuf message types. A short summary:

...

The “partition” label should be “unassigned” for not yet partitioned messages, as they are not yet assigned to a partition queue.

Host process metrics (optional)

These metrics are optional to implementprovide runtime information about the operating system process the client runs in.

Metric name	Type	Labels	Description
client.process.memory.bytes	Gauge		Current process/runtime memory usage (RSS, not virtual).
client.process.cpu.user.time	Sum		User CPU time used (seconds).
client.process.cpu.system.time	Sum		System CPU time used (seconds).
client.process.io.wait.time	Sum		IO wait time (seconds).
client.process.pid	Gauge		The process id. Can be used, in conjunction with the client . host .name, to name to map multiple client instances to the same process. Only emitted if private metrics are enabled.

...

Label name	Description
client_software_name	The client’s implementation name.
client_software_version	The client’s version
client_instance_id	The generated CLIENT_INSTANCE_ID.
client_id	client.id
application_id	`application.id` (Kafka Streams only)
client_rack	client.rack (if configured)
group_id	group.id (consumer)
group_instance_id	group.instance.id (consumer)
group_member_id	Group member id (if any, consumer)
transactional_id	transactional.id (producer)
hostname	Hostname of the client machine. Only emitted if private metrics are enabled.
os	Operating system name, version, architecture, distro, etc. Only emitted if private metrics are enabled.
runtime	Runtime environment, e.g., the JVM version, .NET runtime, Python interpreter version, etc. Only emitted if private metrics are enabled.

Broker-added labels

The following labels are added by the broker as metrics are received

...

Code Block

	bash
	bash

bin/kafka-client-metrics.sh [arguments] --bootstrap-server <brokers>

List configured metrics:
	--list
	[--id <client-instance-id|prefix-match>]

Add metrics:
	--add
	 --id <client-instance-id|prefix-match>
	 --metric <metric-prefix>..
	 --interval-ms <interval>

Delete metrics:
	--delete
	 --id <client-instance-id|prefix-match>
	[--metric <metric-prefix>]..

Example:

# Subscribe to producer partition queue and memory usage
# metrics every 60s from all librdkafka clients.

$ kafka-client-metrics.sh --bootstrap-server localhost:9092 \
	--add \
	--id rdkafka \
	--metric org.apache.kafka/.client.producer.partition. \
	--metric librdkafka/.client.memory. \
	--interval 60000

# The metrics themselves are not viewable with this CLI tool
# since the storage of metrics is plugin-dependent.

...

The monitoring system detects an anomaly for CLIENT_INSTANCE_ID=java-producer-1234’s metric org.apache.kafka/.client.producer.partition.queue.latency which for more than 180 seconds has exceeded the threshold of 5000 milliseconds.

...

The Kafka operator adds a metrics subscription for metrics matching prefix “org.apache.kafka/.client.consumer.” and with the corresponding client.id as resource-name prefix. Since this is a live troubleshooting case the metrics push interval is set to a low 10 seconds.

...

Space shortcuts

Child pages

Versions Compared

Old Version 13

New Version 14

Key

Motivation

Metric types

Host process metrics (optional)

Broker-added labels

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 13

New Version 14

Key

Motivation

Metric types

Host process metrics (optional)

Broker-added labels