Status

Current state: Under Discussion

Discussion thread:

JIRA:

Unable to render Jira issues macro, execution error.

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Kafka Streams exposes metrics on various levels. The number of metrics grows with the number of tasks, the number of processors, the number state stores and the number of buffers in a Kafka Streams application. Hence, the number of metrics might grow substantially for a Kafka Streams client if it executes many task and/or many processors as well as if it has many state stores and/or many buffers.

Some users monitor their Kafka Streams applications by using commercial monitoring services. Those services often limit the number of metrics that can be reported to them. Some providers truncate the metrics when the limit is exceeded. That means, that some metrics are then not sent to the monitoring service, which might lead to false alerts. For example, In Kafka Streams the metric alive-stream-threads records the number of alive stream threads. Users might configure their monitoring service to alert them on this metric when a stream thread dies. If metric alive-stream-threads is removed from the reported metrics because the limit of the number of reported metrics of the monitoring service is exceeded, users will get an alert although no stream thread actually died.

In this KIP, we propose to add an API to the Kafka Streams client that adds a metric that records the aggregation of other metrics. The metric that records the aggregation can then be reported to the monitoring service instead of reporting multiple metrics that would be aggregated in the monitoring service anyway. In such a way users can avoid exceeding the limit of number of reported metrics of the monitoring service and the associated possible false alerts.

Public Interfaces

package org.apache.kafka.streams;

public class KafkaStreams implements AutoCloseable {

	public <AGG, V> void addMetricsAggregation(final String name,
                                          	   final String description,
                                               final List<String> tagLabels,
                                               final RecordingLevel recordingLevel,
                                               final Supplier<AGG> initialAggregateSupplier,
                                               final BiFunction<AGG, V, AGG> aggregationFunction,
                                               final String groupOfMetricsToAggregate,
                                               final String nameOfMetricsToAggregate);
}

Proposed Changes

We propose to add the above method to the KafkaStreams class. The behavior of the method is described in this section.

Method KafkaStreams#addMetricsAggregation() creates one or more metrics on client-level that record the aggregation of the metrics specified by the arguments groupOfMetricsToAggregate and nameOfMetricsToAggregate. Before the specified metrics are aggregated, they are grouped by the tag labels provided in argument tagLabels. For example, if users want to aggregate state-store-level metric size-all-mem-tables grouped by stream threads, they need to provide the name size-all-mem-tables as argument nameOfMetricsToAggregate, the type stream-state-metrics as argument groupOfMetricsToAggregate, and the list of tag labels [thread-id] as argument tagLabels. If they additionally want to aggregate the metrics by task, they need to provide [thread-id, task-id] as argument tagLabels.

Assuming argument tagLabels has n elements, the metrics that record the aggregation of the specified metrics are added with the following configuration:

type: stream-metrics
client-id: [client-id]
[tagLabels.get(0)]: [tag value of the aggregated metrics for tag label tagLabels.get(0)]
...
[tagLabels.get(n)]: [tag value of the aggregated metrics for tag label tagLabels.get(n)]
name: [name provided as argument name]

In the example where users want to aggregate metric size-all-mem-tables by stream threads and tasks, the added metric will have the following configuration:

type: stream-metrics
client-id: [client-id]
[thread-id]: [thread-id of metrics size-all-mem-tables that are aggregated in this metric]
[task-id]: [task-id of metrics size-all-mem-tables that are aggregated in this metric]
name: [name provided as argument name]

One metric per each combination of tag values of different tag labels for which the metric to aggregate exists is added.

Consider the previous example and let's assume there exist stream-thread-1 and stream-thread-2. Stream-thread-1 has tasks 0_1 , 1_0, 1_1 and 1_2, and stream-thread-2 has task 0_0 and 0_2. Furthermore, let's assume that only tasks 0_0, 0_1, and 0_2 contain the metric (e.g. have a RocksDB state store). Then three metrics that record aggregations are added:

stream-thread-1 and task 0_1
stream-thread-2 and task 0_0
stream-thread-2 and task 0_2

The function that is used for the aggregation is specified by argument aggregationFunction and the initial value of the aggregate is specified by initialAggregateSupplier. The aggregation function takes the current aggregate as first argument and the value to add to the aggregate as second argument.

The following code example shows how to add a metrics aggregation for the state-store-level metric size-all-mem-tables by stream threads and tasks:

kafkaStreams.addMetricsAggregation(
	"size-all-mem-tables-aggregation",
	"records the aggregation of the sizes of all mem-tables grouped by stream threads and task",
	Arrays.asList("thread-id", "task-id"),
	RecordingLevel.INFO,
	() -> 0,
	(oldAggregate, value) -> oldAggregate + value,
	"stream-state-metrics",
	"size-all-mem-tables"    
);

Compatibility, Deprecation, and Migration Plan

The proposal is backward-compatible because it only adds a new method and does not change any existing methods.

No methods need to be deprecated and no migration plan is required.

Rejected Alternatives

Add the method to the StreamsMetrics interface: Adding the method to the StreamsMetrics would imply that the method could be called from everywhere within a processor that has access to an implementation of the StreamsMetrics interface. That would require, some ...

Space shortcuts

Child pages

Status

Motivation

Public Interfaces

Proposed Changes

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Space shortcuts

Child pages

KIP-674: API to Aggregate Metrics in Kafka Streams

Status

Motivation

Public Interfaces

Proposed Changes

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives