Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Master KIP

KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum (Accepted)

Status

Current state:  Under DiscussionAdopted

Discussion thread: here

JIRA:

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyKAFKA-8836

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Leader and ISR information is stored in the `/brokers/topics/[topic]/partitions/[partitionId]/state` znode. It can be modified by both the controller and the current leader in the following circumstances:

...

  • It will not be possible for the controller to send stale metadata through the LeaderAndIsr and UpdateMetadata APIs. New requests will always reflect the latest state.
  • The controller can reject inconsistent leader and ISR changes. For example, if the controller sees a broker as offline, it can refuse to add it back to the ISR even though the leader still sees the follower fetching.
  • When updating leader and ISR state, it won't be necessary to reinitialize current state (see KAFKA-8585). Preliminary testing shows this can cut controlled shutdown time down by as much as 40% (take this with a big grain of salt).
  • Partition reassignments will complete sooner since there is no delayed propagation of updated ISRsreassignments complete only when new replicas are added to the ISR. With this change, reassignments can complete sooner because the controller does not have to await change notification.

Below we discuss the behavior of the new API in more detail.

Public Interfaces

We will introduce a new AlterIsr API which requires CLUSTER_ACTION permission (similar to other InterBroker APIs such as LeaderAndIsr). The request and response schema definitions are provided below:

Code Block
AlterIsrRequest => BrokerId BrokerEpoch [Topic [PartitionId LeaderAndIsr]]
  BrokerId => Int32
  BrokerEpoch => Int64
  Topic => String
  PartitionId => Int32
  LeaderAndIsr => Leader LeaderEpoch Isr CurrentVersionCurrentZkVersion
    Leader => Int32
    LeaderEpoch => Int32
    Isr => [Int32]
    CurrentVersionCurrentZkVersion => Int32

AlterIsrResponse => ErrorCode [Topic [PartitionId Version ErrorCode]]
  ErrorCode => Int16
  Topic => String
  PartitionId => Int32
  Version => Int32

Possible top-level errors:

...

  • FENCED_LEADER_EPOCH: There is a new leader for the topic partition. The leader should not retry.
  • INVALID_REQUEST: The update was rejected due to some internal inconsistency (e.g. invalid replicas specified in the ISR)
  • INVALID_ISR_VERSION (NEW): The (Zk)version in the request is out of date. Likely there is a pending LeaderAndIsr request which the leader has not yet received.
  • UNKNOWN_TOPIC_OR_PARTITION: The topic no longer exists.

Proposed Changes

The main change in this KIP is to send the AlterIsr request to the controller when shrinking or expanding the ISR on the leader. The controller will update the state asynchronously and send a LeaderAndIsr request once the change has been completed.

...

Improved batching: We are also providing the opportunity for additional batching when updating Zookeeper. For example, ISR expansions following a rolling restart are often not time sensitive. In the future, the controller could delay the expansion in order to do multiple updates at once. Here we take a simpler approach: when the leader receives a Fetch request from a follower, some number of partitions will come into sync. The leader will send the AlterIsr for all of these partitions. The controller will respond as soon as it receives the request and apply the ISR updates. Once they are complete, it will send LeaderAndIsr updates for all affected partitions. This handles a common case today during a rolling restart when a broker comes back online and is added to many ISRs at once. In fact, the lack of batching for this case today introduces the risk of a replica falling out of ISR while all of these updates are being applied individually by the leader.

It can happen that a leader and ISR update fails after the AlterIsr was acknowledged. For example, if the controller fails before applying the changes, a new controller will be elected. It could also happen that there was already a pending change at the time the leader wanted to update ISR. In either case, the leader will receive a LeaderAndIsr update with the new version. The leader will check the status of current replicas at that time to see whether any additional changes are needed.

Compatibility, Deprecation, and Migration Plan

Use of the new AlterIsr API will be gated by the `inter.broker.protocol.version` config. Once the IBP has been bumped to the latest version, the controller will no longer register a watch for the leader and ISR notification znode.

Rejected Alternatives

None yet.