...
Current state: Under discussion
Discussion thread: here and now here
JIRA: here TBD
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
...
The following examples illustrate the derivation of the telemetry metric names from Kafka metric names:
Kafka metric name | Telemetry metric name |
---|---|
"connection-creation-rate", group="producer-metrics" |
|
"rebalance-latency-max", group="consumer-coordinator-metrics" |
|
Other vendor or implementation-specific metrics can be added according to the following examples, using "contrib"
followed by an implementation-specific name as the namespace:
Implementation-specific metric name | Telemetry metric name |
---|---|
Client "io.confluent.librdkafka" Metric name "client.produce.xmitq.latency" |
|
Python client "com.example.client.python" Metric name "object.count" |
|
Metrics may also hold any number of attributes which provide the multi-dimensionality of metrics. These are similarly derived from the tags of the Kafka metrics, and thus the properties of the equivalent JMX MBeans, replacing '-' with '_'. For example:
Kafka metric name | Telemetry metric name |
---|---|
"request-latency-avg", group="producer-node-metrics", client-id={client-id}, node-id={node-id} |
Attribute keys: |
Sparse metrics
To keep metrics volume down, it is recommended that a client only sends metrics with a recorded value.
...
All standard telemetry metric names begin with the prefix "messaging.kafka."
. This is omitted from the table for brevity. The required metrics are bold.
Telemetry metric name | Type | Labels | Description | Existing Kafka metric name |
---|---|---|---|---|
| Gauge | The rate of connections established per second. | “connection-creation-rate”, group=”producer-metrics” | |
| Sum | The total number of connections established. | “connection-creation-total”, group=”producer-metrics” | |
| Gauge | node_id | The average request latency in ms for a node. | “request-latency-avg”, group=”producer-node-metrics” |
| Gauge | node_id | The maximum request latency in ms for a node. | “request-latency-max”, group=”producer-node-metrics” |
| Gauge | The average time in ms a request was throttled by the broker. | “produce-throttle-time-avg”, group=“producer-metrics” | |
| Gauge | The maximum time in ms a request was throttled by the broker. | “produce-throttle-time-max”, group=“producer-metrics” | |
| Gauge | The average time in ms record batches spent in the send buffer. | “record-queue-time-avg”, group=“producer-metrics” | |
| Gauge | The maximum time in ms record batches spent in the send buffer. | “record-queue-time-max”, group=“producer-metrics” |
Standard consumer metrics
All standard telemetry metric names begin with the prefix "messaging.kafka."
. This is omitted from the table for brevity. The required metrics are bold.
Telemetry metric name | Type | Labels | Description | Existing metric name |
---|---|---|---|---|
| Gauge | The rate of connections established per second. | “connection-creation-rate”, group= “consumer-metrics” | |
| Sum | The total number of connections established. | “connection-creation-total”, group=”consumer-metrics” | |
| Gauge | node_id | The average request latency in ms for a node. | “request-latency-avg”, group= “consumer-node-metrics” |
| Gauge | node_id | The maximum request latency in ms for a node. | “request-latency-max”, group=“consumer-node-metrics” |
| Gauge | The average fraction of time the consumer’s poll() is idle as opposed to waiting for the user code to process records. | “poll-idle-ratio-avg”, group=“consumer-metrics” | |
| Gauge | The average time taken for a commit request. | “commit-latency-avg”, group=“consumer-coordinator-metrics” | |
| Gauge | The maximum time taken for a commit request. | “commit-latency-max”, group=“consumer-coordinator-metrics” | |
| Gauge | The number of partitions currently assigned to this consumer. | “assigned-partitions”, group=“consumer-coordinator-metrics” | |
| Gauge | The average time taken for group rebalance. | “rebalance-latency-avg”, group=“consumer-coordinator-metrics” | |
| Gauge | The maximum time taken for a group rebalance. | “rebalance-latency-max”, group=“consumer-coordinator-metrics” | |
| Sum | The total time taken for group rebalances. | “rebalance-latency-total”, group=“consumer-coordinator-metrics” | |
| Gauge | The average time taken for a fetch request. | “fetch-latency-avg”, group=“consumer-fetch-manager-metrics” | |
| Gauge | The maximum time taken for a fetch request. | “fetch-latency-max”, group=“consumer-fetch-manager-metrics” |
Standard client resource labels
The following labels should be added by the client as appropriate before metrics are pushed.
Label name | Description |
application_id | application.id (Kafka Streams only) |
client_rack | client.rack (if configured) |
group_id | group.id (consumer) |
group_instance_id | group.instance.id (consumer) |
group_member_id | Group member id (if any, consumer) |
transactional_id | transactional.id (producer) |
Broker-added labels
The following labels should be added by the broker plugin as metrics are received.
Label name | Description |
client_instance_id | The generated CLIENT_INSTANCE_ID. |
client_id | client.id as reported in the Kafka protocol header. |
client_software_name | The client’s implementation name as reported in ApiVersionRequest. |
client_software_version | The client’s version as reported in ApiVersionRequest. |
client_source_address | The client connection’s source address. |
client_source_port | The client connection’s source port. |
principal | Client’s security principal. Content depends on authentication method. |
broker_id | Receiving broker’s node-id. |
Client behavior
A client that supports this metric interface and identifies a supporting broker (through detecting at least GetTelemetrySubscriptionsRequestV0 in the ApiVersionResponse) will start off by sending a GetTelemetrySubscriptionsRequest with the ClientInstanceId field set to Null to one randomly selected connected broker to gather its client instance id, the subscribed metrics, the push interval, accepted compression types, etc. This handshake with a Null ClientInstanceId is only performed once for a client instance's lifetime. Sub-sequent GetTelemetrySubscriptionsRequests must include the ClientInstanceId returned in the first response, regardless of broker.
...
Actions to be taken by the client if the GetTelemetrySubscriptionsResponse.ErrorCode or PushTelemetryResponse.ErrorCode is set to a non-zero value.
Error code | Reason | Client action |
InvalidRecord (87) | Broker failed to decode or validate the client’s encoded metrics. | Log a warning to the application and schedule the next GetTelemetrySubscriptionsRequest to 5 minutes. |
UnknownSubscriptionId (NEW) | Client sent a PushTelemetryRequest with an invalid or outdated SubscriptionId, the configured subscriptions have changed. | Send a GetTelemetrySubscriptionRequest to update the client's subscriptions. |
UnsupportedCompressionType (76) | Client’s compression type is not supported by the broker. | Send a GetTelemetrySubscriptionRequest to get an up-to-date list of the broker's supported compression types (and any subscription changes). |
The 5 and 30 minute retries are to eventually trigger a retry and avoid having to restart clients if the cluster metrics configuration is disabled temporarily, e.g., by operator error, rolling upgrades, etc.
...
This applies to producers, consumers, admin client, and of course embedded uses of these clients in frameworks such as Kafka Connect.
Configuration | Description | Values |
---|---|---|
enable.metrics.push | Whether to enable pushing of client metrics to the cluster, if the cluster has a client metrics subscription which matches this client. |
|
Client metrics configuration
These are the configurations for client metrics resources. A client metrics subscription is defined by the configurations for a resource of type CLIENT_METRICS
.
Configuration | Description | Values |
---|---|---|
metrics | A list of telemetry metric name prefixes which specify the metrics of interest. | An empty list means no metrics subscribed. A list containing just an empty string means all metrics subscribed. Otherwise, the list entries are prefix-matched against the metric names. |
interval.ms | The client metrics push interval in milliseconds. | Default: 30000 (5 minutes) |
match | The match criteria for selecting which clients the subscription matches. If a client matches all of these criteria, the client matches the subscription. | A list of key-value pairs. The valid keys are:
The values are anchored regular expressions. |
New error codes
UnknownSubscriptionId
- Client sent a PushTelemetryRequest with an invalid or outdated SubscriptionId. The configured subscriptions have changed.
...