Status
Current state: Under Discussion"
Discussion thread: here [Change the link from the KIP proposal email archive to your own email thread]
JIRA:
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
When debugging Kafka Streams application performance, it can be hard to isolate the problem to a bottleneck in Kafka Streams. The poll-ratio and processing-ratio metrics reported today are helpful, but limited in sample size: they tell you the proportion of time spent in poll or processing records for the last poll loop at an instant in time. They are also a bit too coarse-grained: poll wraps some computationally-significant work, like calls to consumer interceptors. Really, we just want to know the proportion of time the application was processing records vs blocked waiting on Kafka. This KIP proposes a new metric, called `blocked-time-total` that measures the total time a thread spent blocked since it was started. Users can sample this metric periodically and use the difference between samples to measure time blocked during an interval.
Public Interfaces
New Metrics
blocked-time-total
tags: thread-id, application-id
description: the total time the Kafka Streams thread spent blocked on Kafka.
flush-time-total
tags: thread-id, application-id
description: the total time the Kafka Streams thread spent in `Producer.flush`
txn-commit-time-total
tags: thread-id, application-id
description: the total time the Kafka Streams thread spent committing transactions (for EOS).
offset-commit-time-total
tags: thread-id, application-id
description: the total time the Kafka Streams thread spent committing offsets (for AOS).
thread-start-time
tags: thread-id, application-id
deciption: the epoch time the Kafka Streams thread was started. This is useful for computing utilization during intervals close to the stream thread start time.
Proposed Changes
flush-time-total: this will be computed as the cumulative sum of time elapsed during calls to Producer.flush
.
txn-commit-time-total: this will be computed as the cumulative sum of time elapsed during calls to StreamsProducer.commitTransaction
from TaskManager
.
offset-commit-time-total: this will be computed as the cumulative sum of time elapsed during calls to Consumer.commitSync
from TaskManager
.
blocked-time-total: this will be a Value
that returns the sum of the following metrics:
consumer’s io-waittime-total
consumer’s iotime-total
restore consumer’s io-waittime-total
restore consumer’s iotime-total
admin client’s io-waittime-total
admin client’s iotime-total
producer’s bufferpool-wait-time-total
flush-time-total
txn-commit-time-total
txn-begin-time-total
Compatibility, Deprecation, and Migration Plan
We're not changing or removing existing metrics, so compatibility/migration is not a concern.
Rejected Alternatives
One alternative we considered was to compute a blocked-ratio at a fixed interval, or possibly a configurable interval. It seems more flexible to just expose the total blocked time and leave windowing to the user.