Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In case a partition fails, the replica fetcher thread would stop tracking the failed partition. Instead of throwing an exception which ends up terminating the thread, an error message will be logged and the partition will be added to the failedPartitions set. The partition would be removed from the fetcherLagStats and partitionStates since partition lag cannot be accurately tracked once fetching is stopped. The thread would continue monitoring rest of the partitions which are lost in the current scenario.

Until the next leader epoch, the partition would remain in the failedPartitions set. At the leader epoch, the failed partitions would be removed from fetcherLagStats and partitionStates, and would be marked as un-failed by removing from the set for failed partitions. Hereafter, the controller can choose the partition as leader or follower and would follow the usual behavior.

This logic will be implemented in AbstractFetcherThread so that it applies to both replica and log dir fetchers.

Some other potential problems that can be addressed - 

...