This page is meant as a template for writing a KIP. To create a KIP choose Tools->Copy on this page and modify with your content and replace the heading with the next KIP number and a description of your issue. Replace anything in italics with your own description.

Status

Current state: [One of "Under Discussion", "Accepted", "Rejected"]

Discussion thread: here [Change the link from the KIP proposal email archive to your own email thread]

JIRA: here [Change the link from KAFKA-1 to your own ticket]

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Describe the problems you are trying to solve.

Public Interfaces

Briefly list any new interfaces that will be introduced as part of this proposal or any existing interfaces that will be removed or changed. The purpose of this section is to concisely call out the public contract that will come along with this feature.

A public interface is any change to the following:

Binary log format
The network protocol and api behavior
Any class in the public packages under clientsConfiguration, especially client configuration
- org/apache/kafka/common/serialization
- org/apache/kafka/common
- org/apache/kafka/common/errors
- org/apache/kafka/clients/producer
- org/apache/kafka/clients/consumer (eventually, once stable)
Monitoring
Command line tools and arguments
Anything else that will likely break existing users in some way when they upgrade

Proposed Changes

Log recovery is a process when a broker start up, if it has previous unclean shutdown, it'll be triggered to make sure the log is in a good state and not get corrupted. The process of log recovery is as below:

iterate all dirs in "log.dirs" config one by one
1. find out the topic partition log folder under the dir
2. Iterate all the topic partition log folders and add them as jobs to thread pool with the "num.recovery.threads.per.data.dir" config number of threads
  1. load all the segments under log folder, suppose there are 10 segments
  2. filter out the segments after "recovery checkpoint", suppose there are 5 segments needed to be recovered
  3. recover the 5 segments, one by one

As we can imagine, if the broker stores a lot of logs, the log recovery process might take hours or days for the log recovery.

So far, we have no way to know the log recovery progress. All we can do is checking the broker log and know it is busy on doing recovery. In this KIP, we're going to expose a "remainingLogsToRecovery" metric to allow the admin have a way to monitor the progress of log recovery.

Public Interfaces

"remainingLogsToRecovery" metric will be added into "kafka.log" → LogManager

Proposed Changes

The proposal is to propose a remainingLogsToRecovery metric to keep the remaining logs number to be recovered. The total number of logs to be recovered will be added in step (b) described in "motivation" section. When each log completes the recovery for all the segments under the log, the remainingLogsToRecovery will be decremented, and in the end, it'll be 0. When broker is not under log recovery state, the number will always be 0Describe the new thing you want to do in appropriate detail. This may be fairly extensive and have large subsections of its own. Or it may be a few sentences. Use judgement based on the scope of the change.

Compatibility, Deprecation, and Migration Plan

What impact (if any) will there be on existing users?
If we are changing behavior how will we phase out the older behavior?
If we need special migration tools, describe them here.
When will we remove the existing behavior?

Rejected Alternatives

No compatibility issue and no migration plan needed because this KIP only adds a metric for log recovery.

Rejected Alternatives

1. output the log recovery progress in logs

This is not conflicted with the KIP, but finding the log recovery progress inside the broker logs is not easy for admins. Actually, during the implementation, we'll also improve the log output to have much clear info for log recovery progress. On the other hands, having the metrics is still a better way to monitor the log recovery progress for adminsIf there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.

Space shortcuts

Child pages

Versions Compared

Old Version 1

New Version 2

Key

Status

Motivation

Public Interfaces

Proposed Changes

Public Interfaces

Proposed Changes

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Rejected Alternatives

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 1

New Version 2

Key

Status

Motivation

Public Interfaces

Proposed Changes

Public Interfaces

Proposed Changes

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Rejected Alternatives