Page History

...

From collected community feedbacks on Streams operational experience, we are lacking several key metrics for the following tasks:

Monitoring: users would build UI consoles that demonstrate some key metrics 24-7. Only the most critical high-level health and status metrics would be consoled here (e.g. instance state, thread state). Alert triggers will usually be set on some threshold for these metrics (e.g. skip-record > 0, consume-latency > 10k, etc).
Information: this can be considered under the monitoring category as well but with different categories of metrics. Such information could include, for example, kafka version, application version (same appId may evolve over time), num.tasks hosted on instance, num.partitions subscribed on clients, etc. These are mostly static gauges that Users normally would not built console for them, but may commonly query these metrics values in operational tasks.
Debugging: when some issues were discovered, users would need to look at finer grained metrics. In other words, they are less frequently queried than the second categories.
Programmables: some time users would like to programmatically query the metrics, either inside their JVMs or as side-cars collocated with additional reporting logic on top of that.

For the above purposes, we want to 1) cleanup Streams Built-in Metrics to have more out-of-the-box useful metrics while trimming those non-useful ones, and 2) improve APIs for User Customized Metrics that let users register them own metrics, based on its "operationName / scopeName / entityName" notions; we would simplify this interface for user's needs, plus making sure it functions correctly.

...

And for Streams built-in metrics, we will clean them up by 1) adding a few instance-level metrics, 2) removing a few non-useful / overlapped-in-function metrics, 3) changing some metrics' recording level as well. Note the symbols tags in the tables below :(the descriptions of the metrics are omitted since their semantics are all straight-forward based on the names of "rate, total, max, avg, static gauge" etc).

$: newly added

! : breaking changes

* : the sensors are created lazily

(→) : parent sensor

	Per-Client	Per-Thread	Per-Task	Per-Processor-Node	Per-State-Store	Per-Cache
	LEVEL 0	LEVEL 1	LEVEL 2	LEVEL 3	LEVEL 3	LEVEL 3
TAGS	type=stream-metrics,client-id=[client-id]	type=stream-thread-metrics,thread-name=[threadId] (! tag name changed)	type=stream-task-metrics,thread-name=[threadId],task-id=[taskId] (! tag name changed)	type=stream-processor-node-metrics,thread-name=[threadId],task-id=[taskId],processor-node-id=[processorNodeId] (! tag name changed)	stream-state-metrics,client-id=[threadId],thread-name=[taskId],[storeType]-state-id=[storeName] (! tag name changed)	type=stream-record-cache-metrics,thread-name=[threadId],task-id=[taskId],record-cache-id=[storeName] (! tag name changed)
version \| commit-id (static gauge)	INFO ($)
application-id (static gauge)	INFO ($)
topology-description (static gauge)	INFO ($)
state (dynamic gauge)	INFO ($)
rebalance-latency (avg \| max)	INFO ($)
rebalance (rate \| total)	INFO ($)
last-rebalance-time (dynamic gauge)	INFO ($)
active-task-process (ratio)		INFO ($)
standby-task-process (ratio)		INFO ($)
process-latency (avg \| max)		INFO	DEBUG	(! removed for now)
process (rate \| total)		INFO	DEBUG ( → ) on source-nodes only	DEBUG
punctuate-latency (avg \| max)		INFO	DEBUG
punctuate (rate \| total)		INFO	DEBUG
commit-latency (avg \| max)		INFO	DEBUG
commit (rate \| total)		INFO	DEBUG
poll-latency (avg \| max)		INFO
poll (rate \| total)		INFO
task-created \| closed (rate \| total)		INFO
enforced-processing (rate \| total)			DEBUG
record-lateness (avg \| max)			DEBUG
dropped-late-records (rate \| total)				INFO * (window processor only) (! name changed)
suppression-emit (rate \| total)				DEBUG * (suppress processor only)
skipped-records (rate \| total)		(! moved to lower level)	INFO * ( → )	INFO * (few processors + record queue only)
suppression-buffer-size (avg \| max)					DEBUG * (suppression buffer only)
suppression-buffer-count (avg \| max)					DEBUG * (suppression buffer only)
expired-window-record-drop (rate \| total)					DEBUG * (window store only)
put \| put-if-absent .. \| get-latency (avg \| max)					DEBUG * (excluding suppression buffer) (! name changed)
put \| put-if-absent .. \| get (rate)					DEBUG * (excluding suppression buffer) (! name changed)
hit-ratio (avg \| min \| max)						DEBUG (! name changed)

A few philosophies behind this cleanup:

...

Space shortcuts

Child pages

Versions Compared

Old Version 2

New Version 3

Key