Status
Current state: Under DiscussionAccepted
JIRA:
Jira |
---|
server | ASF JIRA |
---|
serverId | 5aa69414-a9e9-3523-82ec-879b028fb15b |
---|
key | KAFKA-10199 |
---|
|
Jira |
---|
server | ASF JIRA |
---|
serverId | 5aa69414-a9e9-3523-82ec-879b028fb15b |
---|
key | KAFKA-10575 |
---|
|
...
Below summarizes the public API changes in this KIP.
Restoration metrics
All the metrics below would be We propose add metrics both on the thread-level (default reporting level is INFO) as well as on the task level (default reporting level is DEBUG).
Note that we will have separate thread handling restoration procedures, and hence their thread id would be different from stream threads.
Metric Thread-level metric tags are:
- type=stream-state-updater-metrics
- clientthread-id=[clientId]threadId]
Task-level metric tags are:
- type=stream-task-metrics
- thread-id=[threadId]
...
The POC implementation of the proposed metrics can be found here: https://github.com/apache/kafka/pull/12391
Metric Name | Level | Type | Description | Notes |
---|
active-restoring-tasks | thread / INFO | count | The number of active tasks currently undergoing restoration |
|
---|
standby-updating-tasks | thread / INFO | count | The number of active tasks currently undergoing updating |
|
---|
active-paused-tasks | thread / INFO | count | The number of active tasks paused restoring |
|
---|
standby-paused-tasks | thread / INFO | count | The number of standby tasks paused updating |
|
---|
idle-ratio | thread / INFO | gauge (percentage) | The fraction of time the thread spent on being idle | idle-ratio + restore/update-ratio + checkpoint-ratio should be 1 |
---|
active-restore-ratio | thread / INFO | gauge (percentage) | The fraction of time the thread spent on restoring active |
---|
or tasks | idle-ratio + restore/update-ratio + checkpoint-ratio should be 1; only one of the restore/update-ratio should be non-zero |
standby-update-ratio | thread / INFO | gauge (percentage) | The fraction of time the thread spent on updating standby tasks | idle-ratio + restore/update-ratio + checkpoint-ratio should be 1; only one of the restore/update-ratio should be non-zero |
---|
checkpoint-ratio | thread / INFO | gauge (percentage) | The fraction of time the thread spent on checkpointing restored progress | idle-ratio + restore/update-ratio + checkpoint-ratio should be 1 |
---|
activerestored-totalcount | rate | thread / INFO | rate | The average per-second |
---|
The total number of records restored/updated for |
active tasksit is for the lifetime of the streams app, hence ever going | all tasks |
|
restore-call-rate | thread / INFO | rate | The average per-second number of restore calls triggered |
|
---|
restore-total | task / DEBUG |
---|
standby-records-updated-total | count | The total number of records |
updated processed during restoration for active |
tasksit is for the lifetime of the streams app, hence ever going | active-records-remaining | count | The number of records remained to be restored | it should be usually declining, and during rebalance it may be jumping up or down |
---|
standby-records-remaining | count | The number of records remained to be updated | it could be usually increasing or declining, and during rebalance it may be jumping up or down |
records-restored-ratethe metric would persist even when the task completed restoration, and would be removed only when the task is removed from the thread. |
restore-rate | task / DEBUG | rate | The average per-second number of records restored for active task | the metric would drop to zero when the task completed restoration, and would be removed only when the task is removed from the thread. |
---|
update-total | task / DEBUG | count | The total number of records updated for standby task | same as above |
---|
update-rate | task / DEBUG | rate | The average per-second number of records |
---|
restored for active or it counts for both active and standby tasks; but at any given time it should be either restoring active or updating standby or being idle | task | same as above |
restore-remaining-records-total | task / INFO | count | The number of records remained to be restored for active tasks |
---|
restore-call-rate | rate | The average per-second number of restore calls triggered
Along with these new metrics, we would also deprecate the metrics below:
...