Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Sum  - Monotonic total count meter (Counter). Suitable for total number of X counters, e.g., total number of bytes sent.
  • Gauge  - Non-monotonic current value meter (UpDownCounter). Suitable for current value of Y, e.g., current queue count.
  • Histogram  - Value distribution meter (ValueRecorder). Suitable for latency values, etc.
    For simplicy a client implementation may choose to provide an average value as Gauge instead of a Histogram. These averages should be using the original Histogram metric name + ".avg60s" (or whatever the averaging period is), e.g., "client.request.rtt.avg10s".

Client instance-level metrics

Metric name

Type

Labels

Description

client.connection.creations

Sum

FIXME: with broker_id label?

Total number of broker connections made.

client.connection.count

Gauge


Current number of broker connections.

client.connection.errors

Sum

reason

Total number of broker connection failures. Label ‘reason’ indicates the reason: 

disconnect - remote peer closed the connection.

auth - authentication failure.

TLS - TLS failure.

timeout - client request timeout.

close - client closed the connection.

client.request.rtt

GaugeHistogram

broker_id

Average request latency / round-trip-time to broker and back

client.request.queue.latency

GaugeHistogram

broker_id

Average request queue latency waiting for request to be sent to broker.

client.request.queue.count

Gauge

broker_id

Number of requests in queue waiting to be sent to broker.

client.request.success

Sum

broker_id

Number of successful requests to broker, that is where a response is received without no request-level error (but there may be per-sub-resource errors, e.g., errors for certain partitions within an OffsetCommitResponse).

client.request.errors

Sum

broker_id

reason

Number of failed requests.

Label ‘reason’ indicates the reason:

timeout - client timed out the request,

disconnect - broker connection was closed before response could be received,

error - request-level protocol error.

client.io.wait.time

GaugeHistogram


Amount of time waiting for socket I/O . FIXME: histogram? Avg? Total?Should this be for POLLOUT only?writability (POLLOUT). A high number indicates socket send buffer congestion.


As the client will not know the broker id of its bootstrap servers the broker_id label should be set to “bootstrap”. FIXME: Should we have a broker_address (“host:port”) for this purpose?

...

Metric name

Type

Labels

Description

client.consumer.poll.interval

Gauge FIXMEHistogram


The interval at which the application calls poll(), in seconds.

client.consumer.poll.last

Gauge


The number of seconds since the last poll() invocation.

client.consumer.poll.latency

GaugeHistogram


The time it takes poll() to return a new message to the application

client.consumer.commit.count

Sum


Number of commit requests sent.

client.consumer.group.assignment.strategyString
Current group assignment strategy in use.

client.consumer.group.assignment.partition.count

Gauge


Number of currently assigned partitions to this consumer by the group leader.

client.consumer.assignment.partition.count

Gauge


Number of currently assigned partitions to this consumer, either through the group protocol or through assign().

client.consumer.group.rebalance.count

Sum


Number of group rebalances.

client.consumer.group.error.countSumerrorConsumer group error counts. The error label depicts the actual error, e.g., "MaxPollExceeded", "HeartbeatTimeout", etc.

client.consumer.record.queue.count

Gauge


Number of records in consumer pre-fetch queue.

client.consumer.record.queue.bytes

Gauge


Amount of record memory in consumer pre-fetch queue. This may also include per-record overhead.

client.consumer.record.application.count

Sum


Number of records consumed by application.

client.consumer.record.application.bytes

Sum


Memory of records consumed by application.

client.consumer.fetch.latency

GaugeHistogram


FetchRequest latency.

client.consumer.fetch.count

Count


Total number of FetchRequests sent.

client.consumer.fetch.failures

Count


Total number of FetchRequest failures.

...

Metric name

Type

Labels

Description

client.producer.partition.queue.bytes

Gauge

topic

partition

acks=all|none|leader

Number of bytes queued on partition queue.

client.producer.partition.queue.count

Gauge

topic

partition

acks=all|none|leader

Number of records queued on partition queue.

client.producer.partition.latency

GaugeHistogram

topic

partition

acks=all|none|leader

Total produce record latency, from application calling send()/produce() to ack received from broker.

client.producer.partition.queue.latency

GaugeHistogram

topic

partition

acks=all|none|leader

Time between send()/produce() and record being sent to broker.

client.producer.partition.record.retries

Sum

topic

partition

acks=all|none|leader

Number of ProduceRequest retries.

client.producer.partition.record.failures

Sum

topic

partition

acks=all|none|leader

reason

Number of records that permanently failed delivery. Reason is a short string representation of the reason, which is typically the name of a Kafka protocol error code, e.g., “RequestTimedOut”.

client.producer.partition.record.success

Sum

topic

partition

acks=all|none|leader

Number of records that have been successfully produced.

...