Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

From collected community feedbacks on Streams operational experience, we are lacking several key metrics for the following tasks:

  • Monitoring: users would build UI consoles that demonstrate some key metrics 24-7. Only the most critical high-level health and status metrics would be consoled here (e.g. instance state, thread state). Alert triggers will usually be set on some threshold for these metrics (e.g. skip-record > 0, consume-latency > 10k, etc).
  • Information: this can be considered under the monitoring category as well but with different categories of metrics. Such information could include, for example, kafka version, application version (same appId may evolve over time), num.tasks hosted on instance, num.partitions subscribed on clients, etc. These are mostly static gauges that Users normally would not built console for them, but may commonly query these metrics values in operational tasks.
  • Debugging: when some issues were discovered, users would need to look at finer grained metrics. In other words, they are less frequently queried than the second categories.
  • Programmables: some time users would like to programmatically query the metrics, either inside their JVMs or as side-cars collocated with additional reporting logic on top of that.

For the above purposes, we want to 1) cleanup Streams Built-in Metrics to have more out-of-the-box useful metrics while trimming those non-useful ones, and 2) improve APIs for User Customized Metrics that let users register them own metrics, based on its "operationName / scopeName / entityName" notions; we would simplify this interface for user's needs, plus making sure it functions correctly.

...

And for Streams built-in metrics, we will clean them up by 1) adding a few instance-level metrics, 2) removing a few non-useful / overlapped-in-function metrics, 3) changing some metrics' recording level as well. Note the symbols tags in the tables below :(the descriptions of the metrics are omitted since their semantics are all straight-forward based on the names of "rate, total, max, avg, static gauge" etc).


$: newly added

! : breaking changes

* : the sensors are created lazily

(→) : parent sensor



LEVEL 0LEVEL 1LEVEL 2LEVEL 3LEVEL 3LEVEL 3

Per-Client

Per-Thread

Per-Task 

Per-Processor-Node Per-State-StorePer-Cache
TAGS

type=stream-metrics,client-id=[client-id]

type=stream-thread-metrics,thread-name=[threadId]


(! tag name changed)

type=stream-task-metrics,thread-name=[threadId],task-id=[taskId]


(! tag name changed)

type=stream-processor-node-metrics,thread-name=[threadId],task-id=[taskId],processor-node-id=[processorNodeId]


(! tag name changed)

stream-state-metrics,client-id=[threadId],thread-name=[taskId],[storeType]-state-id=[storeName]


(! tag name changed)

type=stream-record-cache-metrics,thread-name=[threadId],task-id=[taskId],record-cache-id=[storeName]


(! tag name changed)

version | commit-id (static gauge)
INFO ($)




application-id (static gauge)
INFO ($)




topology-description (static gauge)
INFO ($)




state (dynamic gauge)
INFO ($)




rebalance-latency (avg | max)
INFO ($)




rebalance (rate | total)
INFO ($)




last-rebalance-time (dynamic gauge)
INFO ($)




active-task-process (ratio)

INFO ($)



standby-task-process (ratio)

INFO ($)



process-latency (avg | max)

INFODEBUG(! removed for now)

process (rate | total)

INFODEBUG ( → ) on source-nodes onlyDEBUG

punctuate-latency (avg | max)

INFODEBUG


punctuate (rate | total)

INFODEBUG


commit-latency (avg | max)

INFODEBUG


commit (rate | total)

INFODEBUG


poll-latency (avg | max)

INFO



poll (rate | total)

INFO



task-created | closed (rate | total)

INFO



enforced-processing (rate | total)


DEBUG


record-lateness (avg | max)


DEBUG


dropped-late-records (rate | total)



INFO * (window processor only)

             (! name changed)



suppression-emit (rate | total)



DEBUG * (suppress processor only)

skipped-records (rate | total)

 (! moved to lower level)INFO * ( → )

INFO * (few processors + record queue only)



suppression-buffer-size (avg | max)




DEBUG * (suppression buffer only)
suppression-buffer-count (avg | max)




DEBUG * (suppression buffer only)
expired-window-record-drop (rate | total)




DEBUG * (window store only)
put | put-if-absent .. | get-latency (avg | max)




DEBUG * (excluding suppression buffer)

                 (! name changed)


put | put-if-absent .. | get (rate)




DEBUG * (excluding suppression buffer)

                 (! name changed)


hit-ratio (avg | min | max)





DEBUG  (! name changed)


A few philosophies behind this cleanup:

...