Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • FailedPartitionsCount - Count of partitions that have failed. Instead of separate metrics, clientId is used as a tag to distinguish between Replica and ReplicaAlterLogDir fetchers.

    TotalReplicaFetcherThreads - Total replica fetcher threads. (we might add if its useful)

    Keeping it consistent with some other metrics like MaxLag.

Proposed Changes

In case a partition fails, the replica fetcher thread would stop tracking the failed partition. Instead of throwing an exception which ends up terminating the thread, an error message will be logged and the partition will be added to the failedPartitions set. The thread would continue monitoring rest of the partitions which are lost in the current scenario.

...

  • Handling exceptions raised during truncatingIf more than 50% partitions on a broker have failed, the broker can be shut down.


Compatibility, Deprecation, and Migration Plan

...

  • Retries - The thread can make attempts to connect to the failed partition which would mostly hit the same problem.
  • Shutting down the broker - If more than 50% partitions on a broker have failed, the broker can be shut down