...
Motivation
Being able to centrally , proactively, and reactively, monitor and troubleshoot problems with Kafka clients is becoming increasingly important as the use of Kafka is expanding within organizations as well as for hosted Kafka services. The typical Kafka client user is now an application owner with little experience in operating Kafka clients, while the cluster operator has profound Kafka knowledge but little insight in the client application.
...
Metrics will be named in a hierarchical format:
<namespace>/.<metric.name>
E.g.:
org.apache.kafka/.client.producer.partition.queue.bytes
The metric namespace is org.apache.kafka for the standard metrics, vendor/implementation specific metrics may be added in separate namespaces, e.g:
librdkafka/.client.producer.xmitq.latency
io.confluent/.client.python.object.count
The Apache Kafka Java client would provide implementation specific metrics in:
org.apache.kafka/.client.java.producer.socket.buffer.full.count
...
All standard metrics are prefixed with “org.apache.kafka/.”, this prefix is omitted from the following tables for brevity.
A client implementation must not add additional metrics or labels under the “org.apache.kafka/.” prefix without a corresponding accepted KIP.
...
The default metrics collection on the client must take extra care not to expose any information about the application and system it runs on as they may identify internal projects, infrastructure, etc, that the user/customer may not want to expose to the Kafka infrastructure owner. This includes information such as hostname, operating system, credentials, runtime environment, etc.
Pushing these types of metric, in particular the runtime environment, could on the other hand be valuable in troubleshooting and may be provided as an opt-in configuration property. These types of metrics are referred to as private metrics and must not be enabled by default .The configuration property to enable these metrics is enable.private.metrics.push=true. This configuration property and these private metrics and labels (enable.telemetry.host.metrics=false) and are optional to implement.The metrics and labels covered by this constraint are indicated in-place in the following tables.
OpenTelemetry specifies a range of relevant metrics:
Metric types
The metric types in the following tables correspond to the OpenTelemetry v1 metrics protobuf message types. A short summary:
...
The “partition” label should be “unassigned” for not yet partitioned messages, as they are not yet assigned to a partition queue.
Host process metrics (optional)
These metrics are optional to implementprovide runtime information about the operating system process the client runs in.
Metric name | Type | Labels | Description |
client.process.memory.bytes | Gauge | Current process/runtime memory usage (RSS, not virtual). | |
client.process.cpu.user.time | Sum | User CPU time used (seconds). | |
client.process.cpu.system.time | Sum | System CPU time used (seconds). | |
client.process.io.wait.time | Sum | IO wait time (seconds). | |
client.process.pid | Gauge | The process id. Can be used, in conjunction with the client . host .name, to name to map multiple client instances to the same process. Only emitted if private metrics are enabled. |
...
Label name | Description |
client_software_name | The client’s implementation name. |
client_software_version | The client’s version |
client_instance_id | The generated CLIENT_INSTANCE_ID. |
client_id | |
application_id | application.id (Kafka Streams only) |
client_rack | client.rack (if configured) |
group_id | group.id (consumer) |
group_instance_id | group.instance.id (consumer) |
group_member_id | Group member id (if any, consumer) |
transactional_id | transactional.id (producer) |
hostname | Hostname of the client machine. Only emitted if private metrics are enabled. |
os | Operating system name, version, architecture, distro, etc. Only emitted if private metrics are enabled. |
runtime | Runtime environment, e.g., the JVM version, .NET runtime, Python interpreter version, etc. Only emitted if private metrics are enabled. |
Broker-added labels
The following labels are added by the broker as metrics are received
...
Code Block | ||||
---|---|---|---|---|
| ||||
bin/kafka-client-metrics.sh [arguments] --bootstrap-server <brokers> List configured metrics: --list [--id <client-instance-id|prefix-match>] Add metrics: --add --id <client-instance-id|prefix-match> --metric <metric-prefix>.. --interval-ms <interval> Delete metrics: --delete --id <client-instance-id|prefix-match> [--metric <metric-prefix>].. Example: # Subscribe to producer partition queue and memory usage # metrics every 60s from all librdkafka clients. $ kafka-client-metrics.sh --bootstrap-server localhost:9092 \ --add \ --id rdkafka \ --metric org.apache.kafka/.client.producer.partition. \ --metric librdkafka/.client.memory. \ --interval 60000 # The metrics themselves are not viewable with this CLI tool # since the storage of metrics is plugin-dependent. |
...
The monitoring system detects an anomaly for CLIENT_INSTANCE_ID=java-producer-1234’s metric org.apache.kafka/.client.producer.partition.queue.latency which for more than 180 seconds has exceeded the threshold of 5000 milliseconds.
...
The Kafka operator adds a metrics subscription for metrics matching prefix “org.apache.kafka/.client.consumer.” and with the corresponding client.id as resource-name prefix. Since this is a live troubleshooting case the metrics push interval is set to a low 10 seconds.
...