You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Status

Current stateUnder Discussion

Discussion thread: here [Change the link from the KIP proposal email archive to your own email thread]

JIRA:

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Leader and ISR information is stored in the `/brokers/topics/[topic]/partitions/[partitionId]/state` znode. It can be modified by both the controller and the current leader in the following circumstances:

  1. The controller creates the initial Leader and ISR on topic creation
  2. The controller can shrink the ISR as part of a controlled shutdown or replica reassignment
  3. The controller can elect new leaders at any time (e.g. preferred leader election)
  4. Leaders can expand or shrink the ISR as followers come in and out of sync.

Since the znode can be modified by both the controller and partition leaders, care must be taken to protect updates. We use the zkVersion of the corresponding znode to protect updates, which means we need a mechanism to propagate it between the controller and leaders and vice versa. Currently , controllers propagate the zkVersion to leaders through the LeaderAndIsr request. Leaders, on the other hand, propagate the zkVersion to the controller by creating a sequential notification znode that the controller watches.

In this KIP, we propose a new AlterIsr API to replace the notification znode in order to give the controller the exclusive ability to update Leader and ISR state. Leaders will use this API to request an ISR change from the controller rather than directly mutating the underlying state. This ensures that the controller will always have the latest Leader and ISR state for all partitions. Concretely, this has the following benefits:

  • It will not be possible for the controller to send stale metadata through the LeaderAndIsr and UpdateMetadata APIs. New requests will always reflect the latest state.
  • The controller can reject inconsistent leader and ISR changes. For example, if the controller sees a broker as offline, it can refuse to add it back to the ISR even though the leader still sees the follower fetching.
  • When updating leader and ISR state, it won't be necessary to reinitialize current state (see KAFKA-8585). Preliminary testing shows this can cut controlled shutdown time down by as much as 40% (take this with a big grain of salt).

Below we discuss the behavior of the new API in more detail.

Public Interfaces

We will introduce a new AlterIsr API which requires CLUSTER_ACTION permission (similar to other InterBroker APIs such as LeaderAndIsr). The request and response schema definitions are provided below:

AlterIsrRequest => BrokerId BrokerEpoch [Topic [PartitionId LeaderAndIsr]]
  BrokerId => Int32
  BrokerEpoch => Int64
  Topic => String
  PartitionId => Int32
  LeaderAndIsr => Leader LeaderEpoch Isr CurrentVersion
    Leader => Int32
    LeaderEpoch => Int32
    Isr => [Int32]
    CurrentVersion => Int32

AlterIsrResponse => ErrorCode [Topic [PartitionId Version ErrorCode]]
  ErrorCode => Int16
  Topic => String
  PartitionId => Int32
  Version => Int32

Possible top-level errors:

  • CLUSTER_AUTHORIZATION_FAILED
  • STALE_BROKER_EPOCH
  • NOT_CONTROLLER

Partition-level errors:

  • FENCED_LEADER_EPOCH: There is a new leader for the topic partition. The leader should not retry.
  • INVALID_REQUEST: The update was rejected due to some internal inconsistency (e.g. invalid replicas specified in the ISR)
  • INVALID_ISR_VERSION (NEW): The (Zk)version in the request is out of date. Likely there is a pending LeaderAndIsr request which the leader has not yet received.
  • UNKNOWN_TOPIC_OR_PARTITION: The topic no longer exists.

Proposed Changes

In this KIP, we propose to let AlterIsr requests be sent to the controller and handled asynchronously without blocking the leader. The basic idea is to always assume the most conservative current ISR when advancing the high watermark given a pending update.

ISR Shrink: we require successful acknowledgement from the controller before a follower can be taken out of the ISR and the high watermark advanced. The leader will send the AlterIsrRequest, but will keep the current ISR fixed until a successful response is received.

ISR Expand: followers are added to the local ISR immediately without waiting for acknowledgement from the controller. Even if the ISR expansion fails, the exposed high watermark will be less than or equal to the "true" value.

Compatibility, Deprecation, and Migration Plan

Use of the new AlterIsr API will be gated by the `inter.broker.protocol.version` config. Once the IBP has been bumped to the latest version, the controller will no longer register a watch for the leader and ISR notification znode.

Rejected Alternatives

None yet.

  • No labels