Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Restructuring metrics around skipped records

I propose to remove all the existing other skipped-record metrics and replace them with only the following metric strategy metrics at the StreamTask StreamThread level:

MetricName[name=skipped-records-rate, group=stream-processor-node-metrics, description=The total number of skipped records, tags={taskclient-id=...(per-thread client-id)}]
MetricName[name=skipped-records-total, group=stream-processor-node-metrics, description=The average number of skipped records per second, tags={taskclient-id=...(per-thread client-id)}]

All of these metrics would be INFO level; task-id and processor-node-id would be set as they are today.

By moving the metrics one level down to the task level, they will automatically be included in the TopologyTestDriver as well.

 

Instead of also maintaining more DEBUG-level granular skipped metrics (such as "skippedDuetoDeserializationError"), we will capture useful details about the record that got skipped (topic, partition, offset) as well as the reason for the skip ("deserialization error", "negative timestamp", etc.) with a WARN level log. 

...

We also discussed keeping one metric per skip reason, but that felt too granular. Plus, having the metrics set to DEBUG creates a visibility problem: A) People can't discover the metric just by looking through the available metrics at run time, B) Assuming there's an unexpected skip reported, operators may turn on DEBUG, but can't be guaranteed to see another skip. Logging the granular information is the alternative we settled on.

Finally, we We discussed including some built-in rollup mechanism to produce aggregated metrics at the top level. This is difficult for us to do well from within the Streams library, since it would involve each instance observing the metrics of all the other instances. We could create some kind of tool to do metric aggregation, but it just seems out of scope for this KIP. If metric aggregation is a general problem the community surfaces, we'll tackle it with a separate KIP. This decision allows us to focus in the library about surfacing the right information at the right level of granularity.

We discussed moving the skipped-records metric to the StreamTask. This has the advantage that they will automatically be included in the TopologyTestDriver, but the disadvantage that they increase the total number of metrics to pay attention to. The StreamThread is the highest level of summary that we can provide without introducing lock contention while updating the metrics.