This page is meant as a template for writing a KIP. To create a KIP choose Tools->Copy on this page and modify with your content and replace the heading with the next KIP number and a description of your issue. Replace anything in italics with your own description.
Status
Current state: "Under Discussion"
Discussion thread: here
JIRA: here
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
Log recovery is a process when a broker start up, if it has previous unclean shutdown, it'll be triggered to make sure the log is in a good state and not get corrupted. The process of log recovery is as below:
- iterate all dirs in "log.dirs" config one by one
- find out the topic partition log folder under the dir
- Iterate all the topic partition log folders and add them as jobs to thread pool with the "num.recovery.threads.per.data.dir" config number of threads
- load all the segments under log folder, suppose there are 10 segments
- filter out the segments after "recovery checkpoint", suppose there are 5 segments needed to be recovered
- recover the 5 segments, one by one
- iterate all the record batches inside the segment
- validate all the batches
- rebuilt the indexes.
As we can imagine, if the broker stores a lot of logs, the log recovery process might take hours or days for the log recovery.
So far, we have no way to know the log recovery progress. All we can do is checking the broker log and know it is busy on doing recovery. In this KIP, we're going to expose a remainingLogsToRecovery
metric to allow the admin have a way to monitor the progress of log recovery.
Public Interfaces
"remainingLogsToRecovery" metric will be added into "kafka.log" → LogManager, like currently we added the OfflineLogDirectoryCount
and LogDirectoryOffline
metrics.
Proposed Changes
The proposal is to propose a remainingLogsToRecovery
metric to keep the remaining logs number to be recovered. The total number of logs to be recovered will be added in step (b) described in "motivation" section. When each log completes the recovery for all the segments under the log, the remainingLogsToRecovery
will be decremented, and in the end, it'll be 0. When broker is not under log recovery state, the number will always be 0.
Compatibility, Deprecation, and Migration Plan
No compatibility issue and no migration plan needed because this KIP only adds a metric for log recovery.
Rejected Alternatives
1. output the log recovery progress in logs
This is not conflicted with the KIP, but finding the log recovery progress inside the broker logs is not easy for admins. Actually, during the implementation, we'll also improve the log output to have much clear info for log recovery progress. On the other hands, having the metrics is still a better way to monitor the log recovery progress for admins.