Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • kafka.log
    • LogManager
      • RemainingLogsToRecover 
        • /tmp/log1 => 5            ← there are 5 logs under /tmp/log1 needed to be recovered
        • /tmp/log2 => 0
      • RemainingSegmentsToRecover
        • /tmp/log1                       ← 2 threads are doing log recovery for /tmp/log1
          • 0 => 10000         ← there are 10000 segments needed to be recovered for thread 0
          • 1 => 3
        • /tmp/log2
          • 0 => 0
          • 1 => 0

It showed, currently, there are still 5 logs (partitions) needed to recover under /tmp/log1 dir. And there are 2 threads doing the jobs, where one thread has 10000 segments needed to recover, and the other one has 3 segments needed to recover.


After a while, the metrics might look like this:
It said, now, there are only 3 logs needed to recover in /tmp/log1, and the thread 0 has 9000 segments left, and thread 1 has 5 segments left (which should imply the thread already completed 2 logs recovery in the period)

  • kafka.log
    • LogManager
      • RemainingLogsToRecover 
        • /tmp/log1 => 3            ← there are 3 logs under /tmp/log1 needed to be recovered
        • /tmp/log2 => 0
      • RemainingSegmentsToRecover
        • /tmp/log1                     ← 2 threads are doing log recovery for /tmp/log1
          • 0 => 9000         ← there are 9000 segments needed to be recovered for thread 0
          • 1 => 5
        • /tmp/log2
          • 0 => 0
          • 1 => 0


Compatibility, Deprecation, and Migration Plan

...

This is not conflicted with the KIP, but finding the log recovery progress inside the broker logs is not easy for admins. Actually, during the implementation, we'll also improve the log output to have much clear info for log recovery progress. On the other hands, having the metrics is still a better way to monitor the log recovery progress for admins.


2. Provide a RemainingBytesToRecovery metric:

Currently, when log manager start up, we'll try to load all logs (segments), and during the log loading, we'll try to recover logs if necessary.
And the logs loading is using "thread pool" as you thought.

So, here's the problem:
All segments in each log folder (partition) will be loaded in each log recovery thread, and until it's loaded, we can know how many segments (or how many Bytes) needed to recover.

That means, if we have 10 partition logs in one broker, and we have 2 log recovery threads (num.recovery.threads.per.data.dir=2), before the threads load the segments in each log, we only know how many logs (partitions) we have in the broker (i.e. RemainingLogsToRecover metric). We cannot know how many segments/Bytes needed to recover until each thread starts to load the segments under one log (partition).

That said, the `RemainingBytesToRecovery` metric is difficult to achieve as you expected. I think the current proposal with `RemainingLogsToRecover` and `RemainingSegmentsToRecover` should already provide enough info for the log recovery progress.