Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The current categorization of topic partitions has a gap as an UnderReplicatedPartition does not tell operators if the reduced ISR set is intentional (repartitioning/restarts) or if there may be something wrong such as a broker has completely failedis generic and triggered in the various situations listed above. This makes it hard for operators as setting an alert for UnderReplicatedPartitions may not be effective as it may be too noisy, and increasing the # of samples needed to trigger the alert increases the time to detect failures.incredibly difficult to use UnderReplicatedPartitions as an indicator for alerting as alerts configured on this metric will trigger whenever there is a change in the ISR set.


In reality, we can actually tolerate a reduced ISR set as long as it meets the minimum insync replica count (as configured by the "min.insync.replicas" configuration), otherwise producers with acks=ALL configured will fail. 


This KIP aims to improve monitoring by proposing a new metric group AtMinIsr, which This KIP aims to fill this gap by proposing a new categorization of partitions: AtMinIsr, which consists of partitions that only have the minimum number minimum number of insync replicas remaining in the ISR set (as configured by "min.insync.replicas"). If a partition is "AtMinIsr", then it suggests something severe has happened, but more importantly that one more failure can result in unavailability so some sort of action should be taken (ex. repartitioning)This new metric can be used as a warning since any partitions that are AtMinIsr are at danger of causing producer unavailability (for acks=ALL producers) if one more replica drops out of the ISR set.

Examples

Example 1:

1 partition

...

In this example, AtMinIsr triggers when there is only 1 insync replica remaining, and tells and tells us that 1 more failure will cause producers with ack=ALL to be unavailable (the partition to go completely offline!

Usage

...

whole partition will be unavailable in this scenario).

Usage

A potential usage of this new AtMinIsr category is:

  1. Set up an alert for to trigger when AtMinIsr > 0 for a period of time
  2. If the alert is triggered, then assess the health of the cluster:
    1. If there is
    broker failure which cannot be fixed quickly, then use partition-
    1. an ongoing maintenance, then no action is needed
    2. Otherwise a broker may be unhealthy. The AtMinIsr partition metric or --at-min-isr-partitions TopicCommand option
    of TopicCommand
    1. can be used to
    quickly
    1. determine the list of topics to repartition if the unhealthy broker(s) cannot be fixed quickly

AtMinIsr Values + Possible Explanations

...

Everything is fine, and business as usual. Nothing to see do here.

2. AtMinIsr is consistently greater than zero for a prolonged period of time

Broker(s) may have failed so this could warrant an alert for an operator to take a look at the health of the cluster to see if any brokers are downed.

3. AtMinIsr spikes from bounces between zero to and non-zero and back down to zero repeatedly

Broker(s) may be experiencing trouble such as high load or temporary network issues which is causing the partition to temporarily fall out of sync.

NOTE: There are still scenarios in which AtMinIsr will be non-zero during planned maintenance. For example, if RF=3 and minIsr is set to 2, then a planned of a broker can cause AtMinIsr to be non-zero. This however should not be occurring outside of planned maintenance.

Public Interfaces

We will introduce two new metrics and a new TopicCommand option to identify AtMinIsr partitions.

...