FailedPartitionsCount - Count of partitions that have failed. Instead of separate metrics, clientId is used as a tag to distinguish between Replica and ReplicaAlterLogDir fetchers.
TotalReplicaFetcherThreads - Total replica fetcher threads. (we might add if its useful)

Proposed Changes

In case of a partition failurefails, the replica fetcher thread would stop tracking the failed partition. Instead of throwing an exception which ends up terminating the thread, an error message will be logged and the partition will be added to the failedPartitions set. The thread would continue to monitor monitoring rest of the partitions which are lost in the current scenario.

Until the next leader epoch, the partition would remain in the failedPartitions set. At the leader epoch, the failed partitions would be removed from fetcherLagStats and partitionStates, and would be marked as un-failed by removing . The failedPartition set would keep track of failed partitions. Once the fetcher stops tracking it, the partition would be removed from the set for failed partitions. Hereafter, the controller may can choose the partition as a leader or follower and rest of would follow the usual behavior remains same.

Since the two replica fetchers (ReplicaFetcherThread and ReplicaAlterLogDirsThread) are quite similar in behavior and are extended from the same class, probably should not make one deviate much from the other.

...

Compatibility, Deprecation, and Migration Plan

TBDThe metric FailedPartitionCount would keep track of the failed partitions. It's a newly added metric which would handle partition failure in a better way. It would avoid losing several healthy partitions in case partition failure occurs.

Rejected Alternatives

TBD

Space shortcuts

Child pages

Versions Compared

Old Version 17

New Version 18

Key

Proposed Changes

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 17

New Version 18

Key

Proposed Changes

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives