Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

If a partition is "AtMinIsr", then it suggests something severe has happened, but more importantly that one more failure can result in unavailability so some sort of action should be taken (ex. repartitioning).

Examples

Example 1:

1 partition

minIsrCount=2

...

In this example, AtMinIsr triggers when there is only 1 insync replica remaining, and tells us that 1 more failure will cause the partition to go completely offline!

AtMinIsr Values + Possible Explanations

1. AtMinIsr is zero

Everything is fine, and business as usual. Nothing to see here.

2. AtMinIsr is consistently greater than zero for a prolonged period of time

Broker(s) may have failed so this could warrant an alert for an operator to take a look at the health of the cluster to see if any brokers are downed.

3. AtMinIsr spikes from zero to non-zero and back down to zero repeatedly

Broker(s) may be experiencing trouble such as high load or temporary network issues which is causing the partition to temporarily fall out of sync.

Public Interfaces

We will introduce a new metric and a new TopicCommand option to identify AtMinIsr partitions.

...