Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Status

Current state: Under DiscussionAccepted

JIRA

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyKAFKA-10199
Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyKAFKA-10575

...

Below summarizes the public API changes in this KIP.

Restoration metrics

All the metrics below would be We propose add metrics both on the thread-level (default reporting level is INFO) as well as on the task level (default reporting level is DEBUG).

Note that we will have separate thread handling restoration procedures, and hence their thread id would be different from stream threads.


Metric Thread-level metric tags are:

  • type=stream-state-updater-metrics
  • thread-id=[threadId]

Task-level metric tags are:

  • type=stream-task-metrics
  • thread-id=[threadId]
  • task-id=[taskId]


The POC implementation of the proposed metrics can be found here: https://github.com/apache/kafka/pull/12391Recording level is: INFO


Metric NameLevel

Type

DescriptionNotes
active-restoring
-active
-tasks
thread / INFOcountThe number of active tasks currently undergoing restoration
restoring

standby-
standby
updating-tasks
thread / INFOcountThe number of active tasks currently undergoing
restoration
updating
active-paused-
active-
tasks
thread / INFOcountThe number of active tasks paused restoring
standby-paused
-standby
-tasks
thread / INFOcountThe number of standby tasks paused
restoring
updating
idle-ratio
thread / INFOgauge (percentage)The fraction of time the thread spent on being idleidle-ratio + restore/update-ratio + checkpoint-ratio should be 1
active-restore-ratio
thread / INFOgauge (percentage)The fraction of time the thread spent on restoring active tasksidle-ratio + restore/update-ratio + checkpoint-ratio should be 1;
only one of the restore/update-ratio should be non-zero
standby-update-ratiothread / INFOgauge (percentage)The fraction of time the thread spent on updating standby tasks

idle-ratio + restore/update-ratio + checkpoint-ratio should be 1;

only one of the restore/update-ratio should be non-zero

checkpoint-ratio
thread / INFOgauge (percentage)The fraction of time the thread spent on checkpointing restored progressidle-ratio + restore/update-ratio + checkpoint-ratio should be 1
restore-records-rate
thread / INFOrateThe average per-second number of records restored/updated for all tasks
restore-call-rate
thread / INFOrateThe average per-second number of restore calls triggered
restore-total
task / DEBUGcountThe total number of records
restoredrestore-records-rate
processed during restoration for active taskthe metric would persist even when the task completed restoration, and would be removed only when the task is removed from the thread.
restore-rate
task / DEBUGrateThe average per-second number of records restored
restore-call-rate
for active taskthe metric would drop to zero when the task completed restoration, and would be removed only when the task is removed from the thread.
update-totaltask / DEBUGcountThe total number of records updated for standby tasksame as above
update-ratetask / DEBUGrateThe average per-second number of
restore calls triggered
records updated for standby tasksame as above
restore-remaining-records-total
task / INFOcountThe number of records remained to be restored for active tasks


Along with these new metrics, we would also deprecate the metrics below:

Metric Name

Type

DescriptionNotes
standby-process-ratio
gaugeTask-level; the fraction of time the processing thread spent on processing this standby taskRemoved since standby tasks are not processed by stream thread


New Method in StateRestoreListener

...

Code Block
languagejava
public interface StateRestoreListener {

    void onRestoreStart(final TopicPartition topicPartition,
                        final String storeName,
                        final long startingOffset,
                        final long endingOffset);

    void onRestoreEnd(final TopicPartition topicPartition,
                      final String storeName,
                      final long totalRestored);

    ...

    /**
     * NEW FUNC. Method called when restoring the {@link StateStore} is pausedsuspended due to the task being suspended from the host.
     *           If the task was resumed after suspension and restoration continues, another {@link onRestoreStart} would be called. 
     */
    default void onRestorePausedonRestoreSuspended(final TopicPartition topicPartition,
                                    final String storeName,
                                    final long totalRestored) {
        // do nothing
    } 
}

...