Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

JIRA: KAFKA-3715: Higher granularity Streams metrics

Motivation

 

  • This KIP proposes the addition of latency and throughput metrics for Kafka Streams at the granularity of each processor node and the addition of count metrics at the granularity of each task. This is in addition to the global rate (which already exists). The idea is to allow users to toggle the recording of these metrics when needed for debugging. The RecordLevel for these granular metrics is debug and a client can toggle the record level by changing the  “metrics.record.level” in the client config. (The introduction of RecordLevel and client config changes are covered in a separate KIP)

Public Interfaces

  • none

Proposed Changes

  • .
  • Enumeration of Sensors: This KIP also proposes the introduction of the following sensorsexposing the metrics registry and several helper functions so that Kafka Streams users can register their own metrics. 


 

Public Interfaces

    • Node punctuate time sensor: This sensor is associated with latency metrics depicting the average and max latency in the punctuate time of a node.

    • Node creation time sensor: This sensor is associated with latency metrics depicting the average and max latency in the creation time of a node.

    • Node destruction time sensor:  This sensor is associated with latency metrics depicting the average and max latency in the destruction time of a node.

    • Node process time sensor: This sensor is associated with latency metrics depicting the average and max latency in the process time of a node.

    • Node throughput sensor: This sensor is associated with throughput metrics depicting the context forwarding rate of metrics through a node, i.e., indicating how many records were forwarded downstream from this processor node.

    • Skipped records sensor in StreamTask:This sensor is associated with a count metric, which helps monitor if streams are well synchronized. The metric measures the difference in the total record count and the number of added records between the last record time. This is useful during debugging as this count should not be off by too much during normal operations.

    Addition of throughput sensors, (either by calling the method addThroughputSensor or the method sensor), now takes a Sensor.RecordLevel as follows
  • A StreamsMetrics class with the following methods:
    public interface StreamsMetrics {

    /**
    * @return The base registry where all the metrics are recorded.
    */
     Metrics registry();

     
  • /**
    * Add a
  • throughput
  • latency sensor. This is equivalent to adding a sensor with metrics on latency and rate.
    *
    * @param scopeName Name of the scope, could be the type of the state store, etc.
    * @param entityName Name of the entity, could be the name of the state store instance, etc.
    * @param recordLevel The recording level (e.g., INFO or DEBUG) for this sensor.
    * @param operationName Name of the operation, could be get / put / delete / etc.
    * @param tags Additional tags of the sensor.
    * @return The added sensor.
    */
     Sensor
  • addThroughputSensor * or {@link #addLatencySensor(String, String, String, Sensor.RecordLevel, String...)} to ensure metric name well-formedness and conformity with the rest
    * of the streams code base.
  • addLatencySensor(String scopeName, String entityName, String operationName, Sensor.RecordLevel recordLevel, String... tags);

     /**
    *
  • Generic sensor creation. Note that for most cases it is advisable to use {@link #addThroughputSensor(String, String, String, Sensor.RecordLevel, String...)}
  •  Record the given latency value of the sensor.
    * @param sensor sensor whose latency we are recording.
    * @param startNs start of measurement time in nanoseconds.
    * @param endNs end of measurement time in nanoseconds.
    */
     void recordLatency(Sensor sensor, long startNs, long endNs);

     /**
    * Add a throughput sensor. This is equivalent to adding a sensor with metrics rate.
    *

  • * @param scopeName Name of the scope, could be the type of the state store, etc.
    * @param entityName Name of the entity, could be the name of the state store instance, etc.
    * @param recordLevel The recording level (e.g., INFO or DEBUG) for this sensor.
    * @param operationName Name of the operation, could be get / put / delete / etc.
    * @param tags Additional tags of the sensor.
    * @return The added sensor.
    */
     Sensor
  • sensor
  • addThroughputSensor(String scopeName, String entityName, String operationName, Sensor.RecordLevel recordLevel, String... tags);

  • At the processor node level, throughput sensors are registered using RecordLevel of DEBUG, i.e,

    metrics.addThroughputSensor(scope, sensorNamePrefix + "." + name, "process-throughput", Sensor.RecordLevel.DEBUG, tagKey, tagValue);


  •  /**
    * Records the throughput value of a sensor.
    * @param sensor sensor whose throughput we are recording.
    * @param value throughput value.
    */
     void recordThroughput(Sensor sensor, long value);

     /**
    * Remove a sensor with the given name.
    * @param name Sensor name to be removed.
    */
     void removeSensor(String name);
    }

Proposed Changes

  • Enumeration of Sensors: This KIP proposes the introduction of the following sensors

    • Node punctuate time sensor: This sensor is associated with latency metrics depicting the average and max latency in the punctuate time of a node.

    • Node creation time sensor: This sensor is associated with latency metrics depicting the average and max latency in the creation time of a node.

    • Node destruction time sensor:  This sensor is associated with latency metrics depicting the average and max latency in the destruction time of a node.

    • Node process time sensor: This sensor is associated with latency metrics depicting the average and max latency in the process time of a node.

    • Node throughput sensor: This sensor is associated with throughput metrics depicting the context forwarding rate of metrics through a node, i.e., indicating how many records were forwarded downstream from this processor node.

    • Skipped records sensor in StreamTask:This sensor is associated with a count metric, which helps monitor if streams are well synchronized. The metric measures the difference in the total record count and the number of added records between the last record time. This is useful during debugging as this count should not be off by too much during normal operations.

  • Addition of new sensors

    • Users can use the provided helped functions addLatencySensor and addThroughputSensor or register metrics directly with the exposed underlying metrics registry obtained through registry().

Compatibility, Deprecation, and Migration Plan

...