Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Below summarizes the public API changes in this KIP.

Restoration metrics

All the metrics below would be We propose add metrics both on the thread-level (default reporting level is INFO) as well as on the task level (default reporting level is DEBUG).

Note that we will have separate thread handling restoration procedures, and hence their thread id would be different from stream threads.


Metric Thread-level metric tags are:

  • type=stream-state-updater-metrics
  • client-id=[clientId]
  • thread-id=[threadId]

...

Task-level metric tags are:

  • type=stream-task-metrics
  • client-id=[clientId]
  • thread-id=[threadId]
  • task-id=[taskId]


The POC implementation of the proposed metrics can be found here: https://github.com/apache/kafka/pull/12391


Metric NameLevel

Type

DescriptionNotes
active-restoring-tasks
thread / INFOcountThe number of active tasks currently undergoing restoration
standby-updating-tasks
thread / INFOcountThe number of active tasks currently undergoing updating
active-paused-tasks
thread / INFOcountThe number of active tasks paused restoring
standby-paused-tasks
thread / INFOcountThe number of standby tasks paused updating
idle-ratio
thread / INFOgauge (percentage)The fraction of time the thread spent on being idleidle-ratio + restore-ratio + checkpoint-ratio should be 1
restore-ratio
thread / INFOgauge (percentage)The fraction of time the thread spent on restoring active or standby tasksidle-ratio + restore-ratio + checkpoint-ratio should be 1
checkpoint-ratio
thread / INFOgauge (percentage)The fraction of time the thread spent on checkpointing restored progressidle-ratio + restore-ratio + checkpoint-ratio should be 1
active-restore-records-
restored-totalcount
rate
thread / INFOrateThe average per-second
The total
number of records restored for all active tasks
it is for the lifetime of the streams app, hence ever going 
standby-records-updated-total
count
min(active-restore-records-rate, standby-update-records-rate) == 0
standby-update-records-rate
thread / INFOrateThe average per-second
The total
number of records updated for
active tasksit is for the lifetime of the streams app, hence ever going 
active-records-remaining
countThe number of records remained to be restoredit should be usually declining, and during rebalance it may be jumping up or down
standby-records-remaining
countThe number of records remained to be updatedit could be usually increasing or declining, and during rebalance it may be jumping up or down
records-restored-rate
all standby tasksmin(active-restore-records-rate, standby-update-records-rate) == 0
restore-call-rate
thread / INFOrateThe average per-second number of restore calls triggered
restore-total
task / DEBUGcountThe total number of records processed during restoration
restore-rate
task / DEBUGrateThe average per-second number of records restored
for active or updated for standby
it counts for both active and standby tasks; but at any given time it should be either restoring active or updating standby or being idle

restore-remaining-records-total
task / INFOcountThe number of records remained to be restored
restore-call-rate
rateThe average per-second number of restore calls triggered


Along with these new metrics, we would also deprecate the metrics below:

...