Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).


Motivation

A topic partition can be in one of four states (assuming replication factor of 3):

(ISR = in sync replica)

3/3 ISRs: OK

2/3 ISRs: WARNING (under-replicated partition)

1/3 ISRs: CRITICAL (under-replicated partition)

0/3 ISRs: FATAL (offline/unavailable partition)

TopicCommand already has the --under-replicated-partitions and --unavailable-partitions flags, but it would be beneficial to include an additional --critical-partitions option that specifically lists out partitions in CRITICAL state (only one remaining ISR left).

With this new option, Kafka users can use this option to identify the exact topic partitions that are critical and need immediate repartitioning. Kafka users can also set up critical alerts to trigger when the output of this command contains partitions.

A couple cases where identifying this CRITICAL state is useful in alerting:

  • Users that have a large amount of topics in a single cluster, making it incredibly hard to manually repartition all topics that have under-replicated partitions, so they only take action when it hits CRITICAL state
  • Users with a high replication-factor that can tolerate some broker failures and only take action when it hits CRITICAL state

The “min.insync.replicas” configuration specifies the minimum number of insync replicas required for a partition to accept messages from the producer. If the insync replica count of a partition falls under the specified “min.insync.replicas”, then the broker will reject messages for producers using acks=all. These producers will suffer unavailability as they will see a NotEnoughReplicas or NotEnoughReplicasAfterAppend exception.


We currently have an UnderMinIsrPartitionCount metric which is useful for identifying when partitions fall under “min.insync.replicas”, however it is still difficult to identify which topic partitions are affected and need fixing.


We can leverage the describe topics command in TopicCommand to add an option —under-minisr-partitions to list out exactly which topic partitions are below “min.insync.replicas”

...

.


Public Interfaces

This change would

...

add an additional flag “—under-minisr-partitions” to TopicCommand, but the output will follow the same format as the “under-replicated-partitions” and “offline-partitions” options.


Proposed Changes

When a user has specified the --critical-partitions option, TopicCommand will only print out topic partitions with ISR count equal to 1 if the replication factor of the topic is greater than 1.

We will not include topic partitions with a replication factor of 1 as they are intended to be single replica partitions so it would not be useful to list them out in this command.

...

The challenge with supporting this additional feature is that the “min.insync.replicas” configuration may be set at a broker or topic level. 


We can get the configured “min.insync.replicas” for a topic by:

(1) Check topic-level configuration in Zookeeper

(2) Use AdminClient to get broker/cluster-level configuration


This means we must add an additional flag —bootstrap-server to use AdminClient to describe broker configurations when we cannot find the configuration override in Zookeeper.


Compatibility, Deprecation, and Migration Plan

As this change adds a new option instead of modifying existing ones, there will not be any compatibility issues or a migration

...

Rejected Alternatives

single-replica-partitions option

We could add this option that lists out all topic partitions that have only one in sync replica. This would include all partitions with a single in sync replica (RF >= 1).

...

.