Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

So far, we have no way to know the log recovery progress. All we can do is check the broker log and know it is busy on doing recovery. In this KIP, we're going to expose a RemainingLogsToRecovery  metric for each log.dir and RemainingSegmentsToRecovery  metric for each recovery thread, to allow the admin have a way to monitor the progress of log recovery.

...

RemainingLogsToRecovery  metric will be added into "kafka.log" → LogManager for each log.dir.

RemainingSegmentsToRecovery metric will be added into "kafka.log" → LogManager for each recovery thread.

Proposed Changes

The proposal is to propose 2 metrics:

1.  a RemainingLogsToRecovery  metric : It's to show the remaining logs number for each log.dir to be recovered. The total number of logs to be recovered will be summed in step (1.b) described in "motivation" section. When each log completes the recovery for all the segments under the log, the RemainingLogsToRecovery will be decremented, and in the end, it'll be 0. When broker is not under log recovery state, the number will always be 0.

2. RemainingSegmentsToRecovery: It's to show the remaining segments to be recovered in each recovery thread (i.e. in each replica log). The total number of segments to be recovered will be calculated in step (1.b.ii) described in "motivation" section. When each segment completes the recovery, the RemainingSegmentsToRecovery will be decremented, and in the end, it'll be 0. When broker is not under log recovery state, the number will always be 0.

For example:

log.dirs=/tmp/log1,tmp/log2

num.recovery.threads.per.data.dir=2

In the jmx, we'll see

  • kafka.log
    • LogManager
      • RemainingLogsToRecover 
        • /tmp/log1 => 5            ← there are 5 logs under /tmp/log1 needed to be recovered
        • /tmp/log2 => 0
      • RemainingSegmentsToRecover
        • /tmp/log1                     ← 2 threads are doing log recovery for /tmp/log1
          • 0 => 1000         ← there are 1000 segments needed to be recovered for thread 0
          • 1 => 10
        • /tmp/log2
          • 0 => 0
          • 1 => 0

Compatibility, Deprecation, and Migration Plan

...