This page is meant as a template for writing a KIP. To create a KIP choose Tools->Copy on this page and modify with your content and replace the heading with the next KIP number and a description of your issue. Replace anything in italics with your own description.
Status
Current state: "Under Discussion"
Discussion thread: here
JIRA: here
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
With the metrics available today it is possible for users of Kafka Streams to derive the consumed throughput of their applications at the task level, but the same is not true for the produced throughput. The consumer client currently reports a cumulative sum metric at the topic level, ie the -total metrics which reports both the number of records and the bytes consumed. These are tagged with the topic name and the (consumer) client id, which is sufficient for computing the bytes/records consumed per subtopology by aggregating over all clients in the Stream application. This computation relies on the fact that Kafka Streams topologies can only consume from a given topic at most once. Unfortunately there is no such guarantee on the other end, and applications can have one or more subtopologies producing to any given output topic.
To fill in this gap and give end users a way to compute the production rate of each subtopology, this KIP proposes two new metrics as follows:
Public Interfaces
The following metrics would be added:
- bytes-produced-total
- records-produced-total
These will be exposed on the processor-level with the following tags:
- type = stream-processor-node-metrics
- thread-id=[threadId]
- task-id=[taskId]
- processor-node-id=[processorNodeId]
These will be reported in the sink nodes at the recording level INFO.
Proposed Changes
Pretty much everything outside of the metric names, recording levels, and tags is an implementation detail so this document won't go much farther here. But just to assuage any concerns about how these metrics will be computed, note that the serialization into raw bytes occurs within Streams before handing them off to the embedded Producer client. And so we should have everything we need to record both the number of records and the size in bytes while we are still within Streams code and have the taskId/subtopology number available.
Compatibility, Deprecation, and Migration Plan
N/A – this KIP is only adding new metrics and should not present any compatibility problems.
Test Plan
Since we're just adding metrics we should have sufficient coverage with basic unit and integration tests. We will also monitor for performance regressions in our benchmarks that sink to output topics just in case this incurs some unexpected overhead and would be better suited at the DEBUG level. See #2 under Rejected Alternatives for more on this.
Rejected Alternatives
- Including the corresponding -rate metrics alongside the -total metrics, as is often done for Streams metrics. For one thing reporting the cumulative sums should be sufficient for any monitoring service to compute the rate if desired, and furthermore allows that service to have full control over how this rate is defined and computed over what window of time.
- In addition the -total metrics don't require checking the current system time as an input and thus don't have the potential performance impact of the -rate or other time-based metrics discussed in previous KIPs. This means we can avoid the tradeoff and feel safe reporting these new metrics at the INFO level despite typically reporting anything task-level or below at DEBUG.