You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 13 Next »

Status

Current stateUnder Discussion

Discussion thread: here

JIRA: here

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

The replica fetcher threads handle multiple partitions. In case a partition fails, the replica fetcher thread associated with that partition terminates. The partitions that have caught up and are running well are also left untracked with termination of the thread which leads to under-replicated partitions. A better approach would be, whenever a partition crashes, the concerned thread should stop tracking the crashed partition and continue handling rest of the partitions.

Public Interfaces

New metrics:

  • `FailedPartitionsCount` - Count of partitions that have failed.

  • `TotalReplicaFetcherThreads` - Total replica fetcher threads. (we might add if its useful)

Proposed Changes

In case of a partition failure, the replica fetcher thread associated with it, would stop tracking it. The thread would continue to monitor rest of the partitions.

Since the two replica fetchers (ReplicaFetcherThread and ReplicaAlterLogDirsThread) are quite similar in behavior and are extended from the same class, probably should not make one deviate much from the other.

Some other potential problems that can be addressed - 

  • Handling exceptions raised during truncating
  • If more than 50% partitions on a broker have failed, the broker can be shut down.


Compatibility, Deprecation, and Migration Plan

  • TBD

Rejected Alternatives

  • Retries: Whenever a partition fails, retries from the thread would lead to repetitive partition failure exceptions.
  • No labels