...
- Sum - Monotonic total count meter (Counter). Suitable for total number of X counters, e.g., total number of bytes sent.
- Gauge - Non-monotonic current value meter (UpDownCounter). Suitable for current value of Y, e.g., current queue count.
- Histogram - Value distribution meter (ValueRecorder). Suitable for latency values, etc.
For simplicy a client implementation may choose to provide an average value as Gauge instead of a Histogram. These averages should be using the original Histogram metric name + ".avg60s" (or whatever the averaging period is), e.g., "client.request.rtt.avg10s".
Client instance-level metrics
Metric name | Type | Labels | Description |
client.connection.creations | Sum | FIXME: with broker_id label? | Total number of broker connections made. |
client.connection.count | Gauge | Current number of broker connections. | |
client.connection.errors | Sum | reason | Total number of broker connection failures. Label ‘reason’ indicates the reason: disconnect - remote peer closed the connection. auth - authentication failure. TLS - TLS failure. timeout - client request timeout. close - client closed the connection. |
client.request.rtt | GaugeHistogram | broker_id | Average request latency / round-trip-time to broker and back |
client.request.queue.latency | GaugeHistogram | broker_id | Average request queue latency waiting for request to be sent to broker. |
client.request.queue.count | Gauge | broker_id | Number of requests in queue waiting to be sent to broker. |
client.request.success | Sum | broker_id | Number of successful requests to broker, that is where a response is received without no request-level error (but there may be per-sub-resource errors, e.g., errors for certain partitions within an OffsetCommitResponse). |
client.request.errors | Sum | broker_id reason | Number of failed requests. Label ‘reason’ indicates the reason: timeout - client timed out the request, disconnect - broker connection was closed before response could be received, error - request-level protocol error. |
client.io.wait.time | GaugeHistogram | Amount of time waiting for socket I/O . FIXME: histogram? Avg? Total?Should this be for POLLOUT only?writability (POLLOUT). A high number indicates socket send buffer congestion. |
As the client will not know the broker id of its bootstrap servers the broker_id label should be set to “bootstrap”. FIXME: Should we have a broker_address (“host:port”) for this purpose?
...
Metric name | Type | Labels | Description |
client.consumer.poll.interval | Gauge FIXMEHistogram | The interval at which the application calls poll(), in seconds. | |
client.consumer.poll.last | Gauge | The number of seconds since the last poll() invocation. | |
client.consumer.poll.latency | GaugeHistogram | The time it takes poll() to return a new message to the application | |
client.consumer.commit.count | Sum | Number of commit requests sent. | |
client.consumer.group.assignment.strategy | String | Current group assignment strategy in use. | |
client.consumer.group.assignment.partition.count | Gauge | Number of currently assigned partitions to this consumer by the group leader. | |
client.consumer.assignment.partition.count | Gauge | Number of currently assigned partitions to this consumer, either through the group protocol or through assign(). | |
client.consumer.group.rebalance.count | Sum | Number of group rebalances. | |
client.consumer.group.error.count | Sum | error | Consumer group error counts. The error label depicts the actual error, e.g., "MaxPollExceeded", "HeartbeatTimeout", etc. |
client.consumer.record.queue.count | Gauge | Number of records in consumer pre-fetch queue. | |
client.consumer.record.queue.bytes | Gauge | Amount of record memory in consumer pre-fetch queue. This may also include per-record overhead. | |
client.consumer.record.application.count | Sum | Number of records consumed by application. | |
client.consumer.record.application.bytes | Sum | Memory of records consumed by application. | |
client.consumer.fetch.latency | GaugeHistogram | FetchRequest latency. | |
client.consumer.fetch.count | Count | Total number of FetchRequests sent. | |
client.consumer.fetch.failures | Count | Total number of FetchRequest failures. |
...
Metric name | Type | Labels | Description |
client.producer.partition.queue.bytes | Gauge | topic partition acks=all|none|leader | Number of bytes queued on partition queue. |
client.producer.partition.queue.count | Gauge | topic partition acks=all|none|leader | Number of records queued on partition queue. |
client.producer.partition.latency | GaugeHistogram | topic partition acks=all|none|leader | Total produce record latency, from application calling send()/produce() to ack received from broker. |
client.producer.partition.queue.latency | GaugeHistogram | topic partition acks=all|none|leader | Time between send()/produce() and record being sent to broker. |
client.producer.partition.record.retries | Sum | topic partition acks=all|none|leader | Number of ProduceRequest retries. |
client.producer.partition.record.failures | Sum | topic partition acks=all|none|leader reason | Number of records that permanently failed delivery. Reason is a short string representation of the reason, which is typically the name of a Kafka protocol error code, e.g., “RequestTimedOut”. |
client.producer.partition.record.success | Sum | topic partition acks=all|none|leader | Number of records that have been successfully produced. |
...