Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Updates following Eric Sirianni's comments

...

Motivation

Being able to centrally , proactively, and reactively, monitor and troubleshoot problems with Kafka clients is becoming increasingly important as the use of Kafka is expanding within organizations as well as for hosted Kafka services. The typical Kafka client user is now an application owner with little experience in operating Kafka clients, while the cluster operator has profound Kafka knowledge but little insight in the client application.

...

Metrics will be named in a hierarchical format:

<namespace>/.<metric.name>

E.g.:

org.apache.kafka/.client.producer.partition.queue.bytes

The metric namespace is org.apache.kafka for the standard metrics, vendor/implementation specific metrics may be added in separate namespaces, e.g:

librdkafka/.client.producer.xmitq.latency
io.confluent/.client.python.object.count

The Apache Kafka Java client would provide implementation specific metrics in:

org.apache.kafka/.client.java.producer.socket.buffer.full.count

...

All standard metrics are prefixed with “org.apache.kafka/., this prefix is omitted from the following tables for brevity.

A client implementation must not add additional metrics or labels under the “org.apache.kafka/. prefix without a corresponding accepted KIP.

...

The default metrics collection on the client must take extra care not to expose any information about the application and system it runs on as they may identify internal projects, infrastructure, etc, that the user/customer may not want to expose to the Kafka infrastructure owner. This includes information such as hostname, operating system, credentials, runtime environment, etc.

Pushing these types of metric, in particular the runtime environment, could on the other hand be valuable in troubleshooting and may be provided as an opt-in configuration property. These types of metrics are referred to as private metrics and must not be enabled by default .The configuration property to enable these metrics is enable.private.metrics.push=trueThis configuration property and these private metrics and labels (enable.telemetry.host.metrics=false) and are optional to implement.The metrics and labels covered by this constraint are indicated in-place in the following tables.

OpenTelemetry specifies a range of relevant metrics:

Metric types

The metric types in the following tables correspond to the OpenTelemetry v1 metrics protobuf message types. A short summary:

...


The “partition” label should be “unassigned” for not yet partitioned messages, as they are not yet assigned to a partition queue.


Host process metrics (optional)

These metrics are optional to implementprovide runtime information about the operating system process the client runs in.

Metric name

Type

Labels

Description

client.process.memory.bytes

Gauge


Current process/runtime memory usage (RSS, not virtual).

client.process.cpu.user.time

Sum


User CPU time used (seconds).

client.process.cpu.system.time

Sum


System CPU time used (seconds).

client.process.io.wait.timeSum
IO wait time (seconds).

client.process.pid

Gauge


The process id. Can be used, in conjunction with the client . host .name, to name to map multiple client instances to the same process.

Only emitted if private metrics are enabled.

...

Label name

Description

client_software_name

The client’s implementation name.

client_software_version

The client’s version

client_instance_id

The generated CLIENT_INSTANCE_ID.

client_id

client.id

application_idapplication.id  (Kafka Streams only)

client_rack

client.rack (if configured)

group_id

group.id (consumer)

group_instance_id

group.instance.id (consumer)

group_member_id

Group member id (if any, consumer)

transactional_id

transactional.id (producer)

hostname

Hostname of the client machine.

Only emitted if private metrics are enabled.

os

Operating system name, version, architecture, distro, etc.

Only emitted if private metrics are enabled.

runtime

Runtime environment, e.g., the JVM version, .NET runtime, Python interpreter version, etc.

Only emitted if private metrics are enabled.

Broker-added labels

The following labels are added by the broker as metrics are received

...

Code Block
bash
bash
bin/kafka-client-metrics.sh [arguments] --bootstrap-server <brokers>

List configured metrics:
	--list
	[--id <client-instance-id|prefix-match>]

Add metrics:
	--add
	 --id <client-instance-id|prefix-match>
	 --metric <metric-prefix>..
	 --interval-ms <interval>

Delete metrics:
	--delete
	 --id <client-instance-id|prefix-match>
	[--metric <metric-prefix>]..

Example:

# Subscribe to producer partition queue and memory usage
# metrics every 60s from all librdkafka clients.

$ kafka-client-metrics.sh --bootstrap-server localhost:9092 \
	--add \
	--id rdkafka \
	--metric org.apache.kafka/.client.producer.partition. \
	--metric librdkafka/.client.memory. \
	--interval 60000

# The metrics themselves are not viewable with this CLI tool
# since the storage of metrics is plugin-dependent.

...

The monitoring system detects an anomaly for CLIENT_INSTANCE_ID=java-producer-1234’s metric org.apache.kafka/.client.producer.partition.queue.latency which for more than 180 seconds has exceeded the threshold of 5000 milliseconds.

...

The Kafka operator adds a metrics subscription for metrics matching prefix “org.apache.kafka/.client.consumer.” and with the corresponding client.id as resource-name prefix. Since this is a live troubleshooting case the metrics push interval is set to a low 10 seconds.

...