Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Status

Current stateUnder Discussion"Accepted

Discussion thread: here [Change the link from the KIP proposal email archive to your own email thread]

JIRA

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyKAFKA-13229

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

When debugging Kafka Streams application performance, it can be hard to isolate the problem to a bottleneck in Kafka Streams. The poll-ratio and processing-ratio metrics reported today are helpful, but limited in sample size: they tell you the proportion of time spent in poll or processing records for the last poll loop at an instant in time. They are also a bit too coarse-grained: poll wraps some computationally-significant work, like calls to consumer interceptors. Really, we just want to know the proportion of time the application was processing records vs blocked waiting on Kafka. This KIP proposes a new metric, called `blocked-time-total` that measures the total time a thread spent blocked since it was started. Users can sample this metric periodically and use the difference between samples to measure time blocked during an interval.

Public Interfaces

New Metrics

Kafka Streams

blocked-time-total
tags: thread-id, application-id
group: stream-thread-metrics
level: INFO. The proposed metrics should be collectible at INFO level without adding meaningful overhead. They require sampling the time twice during the corresponding API calls, which nowadays is very cheap.
description: the total time the Kafka Streams thread spent blocked on Kafka.

thread-start-time
tags: thread-id
group: stream-thread-metrics
level: INFO
description: the epoch time the Kafka Streams thread was started. This is useful for computing the processing ratio during the first interval after the thread starts. 

Producer

flush-time-ns-total
tags: threadclient-id, application-id
group: producer-metrics
level: INFO
description: the total time the Kafka Streams thread Producer spent in `Producer.flush` in nanoseconds.

txn-init-time-ns-total
tags: client-id
group: producer-metrics
level: INFO
description: the total time the Producer spent initializing transactions in nanoseconds (for EOS).

txn-commitbegin-time-ns-total
tags: thread-id, application-idclient-id
group: producer-metrics
level: INFO
description: the total time the Producer spent in beginTransaction in nanoseconds (for EOS).

txn-send-offsets-time-ns-total
tags: client-id
group: producer-metrics
level: INFO
description: the total time the Kafka Streams thread spent committing transactions Producer spent sending offsets to transactions in nanoseconds (for EOS).

offsettxn-commit-time-ns-total
tags: threadclient-id, application-id
group: producer-metrics
level: INFO
description: the total time the Kafka Streams thread Producer spent committing offsets transactions in nanoseconds (for AOSEOS).

threadtxn-startabort-time-ns-total
tags: thread-id, application-id
deciption: the epoch time the Kafka Streams thread was started. This is useful for computing utilization during intervals close to the stream thread start timeclient-id
group: producer-metrics
level: INFO
description: the total time the Producer spent aborting transactions in nanoseconds (for EOS).

Consumer

commited-time-ns-total
tags: client-id
group: consumer-metrics
level: INFO
description: the total time the Consumer spent in committed in nanoseconds.

commit-sync-time-ns-total
tags: client-id
group: consumer-metrics
level: INFO
description: the total time the Consumer spent committing offsets in nanoseconds (for AOS).

Proposed Changes

flush-time-ns-total: this will be a Producer metric computed as the cumulative sum of time elapsed during calls to Producer.flush.

txn-init-time-ns-total: this will be a Producer metric computed as the cumulative sum of time elapsed during calls to Producer.initTransactions.

txn-begin-time-ns-total: this will be a Producer metric computed as the cumulative sum of time elapsed during calls to Producer.beginTransaction.

txn-send-offsets-time-ns-total: this will be a Producer metric computed as the cumulative sum of time elapsed during calls to Producer.sendOffsetsToTransaction.

txn-commit-time-ns-total: this will be a Producer metric computed as the cumulative sum of time elapsed during calls to StreamsProducer.commitTransaction from TaskManager. Producer.commitTransaction.

txn-abort-time-ns-total: this will be a Producer metric computed as the cumulative sum of time elapsed during calls to Producer.abortTransaction.

commited-time-ns-total: this will be a Consumer metric computed as the cumulative sum of time elapsed during calls to Consumer.committed.

commit-sync-time-nsoffset-commit-time-total: this will be a Consumer metric computed as the cumulative sum of time elapsed during calls to Consumer.commitSync from TaskManager.

blocked-time-ns-total: this will be a Value that returns the sum of the following metrics:

  • consumer’s io-waittime-total
  • consumer’s iotime-total
  • consumer’s committed-time-ns-total
  • consumer’s commit-sync-time-ns-total
  • restore consumer’s io-waittime-total
  • restore consumer’s iotime-total
  • admin client’s io-waittime-total
  • admin client’s iotime-total
  • producer’s bufferpool-wait-time-total
  • producer's flush-time-ns-total
  • producer's txn-init-time-ns-total
  • producer's txn-begin-time-ns-total
  • producer's txn-send-offsets-time-ns-total
  • producer's txn-commit-time-ns-total
  • producer's txn-

...

  • abort-time-ns-total

Compatibility, Deprecation, and Migration Plan

We're not changing or removing existing metrics, so compatibility/migration is not a concern.

Rejected Alternatives

One alternative we considered was to compute a blocked-ratio at a fixed interval, or possibly a configurable interval. It seems more flexible to just expose the total blocked time and leave windowing to the user.