Status

Current state: Draft

Discussion thread:

JIRA:

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

To complete the plan for KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum, we need a way to migrate Kafka clusters from a ZooKeeper quorum to a KRaft quorum. This must be done without impact to partition availability and with minimal impact to operators and client applications.

In order to give users more confidence about undertaking the migration to KRaft, we will allow a rollback to ZooKeeper until the final step of the migration. This is accomplished by writing two copies of the metadata during the migration – one to the KRaft quorum, and one to ZooKeeper.

This KIP defines the behavior and set of new APIs for the “bridge release” as first mentioned in KIP-500.

Public Interfaces

Metrics

MBean name	Description
kafka.server:type=KafkaServer,name=MigrationState	An enumeration of the possible migration states the broker is in. Each broker reports this. All migration states except "MigrationReady" are able to be reported by the brokers.
kafka.controller:type=KafkaController,name=MigrationState	An enumeration of the possible migration states the cluster can be in. This is only reported by the active controller. The "ZooKeeper" and "MigrationEligible" states are reported by the ZK controller, while the remaining states are reported by the KRaft controller.
kafka.controller:type=KafkaController,name=ZooKeeperWriteBehindLag	The amount of lag in records that ZooKeeper is behind relative to the highest committed record in the metadata log. This metric will only be reported by the active KRaft controller.

New metadata.version (IBP)

A new metadata.version will be used for a few things in this design.

Gate the usage of a new MigrationCheck RPC
Allow the migration to begin
Enable forwarding on all brokers (KIP-590: Redirect Zookeeper Mutation Protocols to The Controller)

All brokers must be running this metadata.version before the migration can begin.

Migration-mode configuration

A new “kafka.metadata.migration.enable” config will be added for the broker and controller. Its default will be “false”. Setting this config to “true” on the brokers is a prerequisite to starting the migration. Setting this to "true" on the KRaft controllers is the trigger for starting the migration (more on that below).

MigrationCheck RPC

Brokers will use the new metadata.version to enable a new MigrationCheck RPC. This RPC will be used by the KRaft controller to determine if the cluster is ready to be migrated. The response will include the cluster ID and a boolean indicating if the migration mode config has been enabled statically on this broker.

The purpose of this RPC is to signal that a broker is able to be migrated. When the KRaft controller begins the migration process, it will first check that the live brokers are able to be migrated.

Request:

{
  "apiKey": TBD,
  "type": "request",
  "name": "MigrationCheckRequest",
  "validVersions": "0",
  "flexibleVersions": "0+",
  "fields": [ ]
}

Response:

{
  "apiKey": TBD,
  "type": "response",
  "name": "MigrationCheckResponse",
  "validVersions": "0",
  "flexibleVersions": "0+",
  "fields": [ 
    {"name": "clusterId": "type": "uuid", "versions": "0+"},
    {"name": "configEnabled": "type": "boolean", "versions": "0+"}
  ]
}

Migration State ZNode

As part of the propagation of KRaft metadata back to ZooKeeper while in dual-write mode, we need to keep track of what has been synchronized. A new ZNode will be introduced to keep track of which KRaft record offset has been written back to ZK. This will be used to recover the synchronization state following a KRaft controller failover.

ZNode /migration

{
  "lastOffset": 100,
  "lastTimestamp": "2022-01-01T00:00:00.000Z",
  "kraftControllerId": 3000,
  "kraftControllerEpoch": 1
}

Controller ZNodes

The two controller ZNodes "/controller" and "/controller_epoch" will be managed by the KRaft quorum during the migration. Rather than using ephemeral ZNodes, the KRaft controller will use a persistent ZNode for "/controller" to prevent ZK brokers from attempting to become the active controller. The "/controller_epoch" ZNode will be managed by the active KRaft controller and incremented anytime a new KRaft controller is elected.

Operational Changes

Forwarding Enabled on Brokers

As detailed in KIP-500 and KIP-590, all brokers (ZK and KRaft) must forward administrative requests such as CreateTopics to the active KRaft controller once the migration has started. When running the new metadata.version defined in this KIP, all brokers will enable forwarding.

Migration Trigger

The migration from ZK to KRaft will be triggered by the cluster's state. To start a migration, the cluster must meet two requirements:

The metadata.version is set to the version added by this KIP. This indicates the software is at a minimum version which includes the necessary logic to perform the migration
All ZK brokers have kafka.metadata.migration.enable set to “true”. This indicates an operator has declared some intention to start the migration

Once these conditions are satisfied, an operator can start a KRaft quorum with kafka.metadata.migration.enable set to “true” to begin the migration.

By utilizing configs and broker/controller restarts, we follow a paradigm that Kafka operators are familiar with.

Migration Overview

Here is a state machine description of the migration.

State	Description
ZooKeeper	The cluster is in ZooKeeper mode
MigrationEligible	The cluster has been upgraded to a minimum software version and has set the necessary static configs
MigrationReady	The KRaft quorum has been started
MigrationActive	ZK state has been migrated, controller is in dual-write mode, brokers are being restarted in KRaft mode
MigrationFinished	All of the brokers have been restarted in KRaft mode, controller still in dual-write mode
KRaft	The cluster is in KRaft mode

And a state machine diagram:

Preparing the Cluster

The first step of the migration is to upgrade the cluster to at least the bridge release version. This will also include setting the metadata.version to the one specified in this KIP. Upgrading the cluster to a well known starting point will reduce our compatibility matrix and ensure that the necessary logic is in place prior to the migration.

Controller Migration

A new set of nodes will be provisioned to host the controller quorum. These controllers will be started with kafka.metadata.migration.enable set to “true”. Once the quorum is established and a leader is elected, the migration process will begin. The ZK data migration will copy the existing ZK data into the KRaft metadata log and establish the new KRaft active controller as the active controller from a ZK perspective.

While in migration mode, the KRaft controller will write to the metadata log as well as to ZooKeeper.

At this point, all of the brokers are running in ZK mode and their broker-controller communication channels operate as they would with a ZK controller. From a broker’s perspective, the controller looks and behaves like a normal ZK controller.

The metadata migration process will cause controller downtime proportional to the total size of metadata in ZK.

In order to ensure consistency of the metadata, we must stop making any writes to ZK while we are migrating the data.

Broker Migration

Following the migration of metadata and controller leadership to KRaft, the brokers are restarted one-by-one in KRaft mode. While this rolling restart is taking place, the cluster will be composed of both ZK and KRaft brokers.

The broker migration phase does not cause downtime, but it is effectively unbounded in its total duration.

There is likely no reasonable way to put a limit on how long a cluster stays in a mixed state since rolling restarts for large clusters may take several hours. We also allow the operator to revert back to ZK during this time.

Finalizing the Migration

Once the cluster has been fully upgraded to KRaft mode, the controller will still be running in migration mode. The operator still has a chance to revert back to ZK.

The time that the cluster is running all KRaft brokers/controllers, but still running in migration mode, is effectively unbounded.

Once the operator has decided to commit to KRaft mode, the final step is to restart the controller quorum and take it out of migration mode by unsetting kafka.metadata.migration.enable. Once the controller leaves migration mode, it will no longer perform writes to ZK and it will disable its special ZK handling of ZK RPCs.

At this point, the cluster is fully migrated. A rollback to ZK is still possible after finishing the migration, but it must be done offline and it will cause metadata loss (which can also cause partition data loss).

Compatibility

Dual Metadata Writes

Metadata will be written to the KRaft metadata log as well as to ZooKeeper during the migration. This gives us two important guarantees: we have a safe path back to ZK mode and compatibility with ZK broker metadata that relies on ZK watches.

At any time during the migration, it should be possible for the operator to decide to revert back to ZK mode. This process should be safe and straightforward. By writing all metadata updates to both KRaft and ZK, we can ensure that the state stored in ZK is up-to-date.

By writing metadata changes to ZK, we also maintain compatibility with a few remaining direct ZK dependencies that exist on the ZK brokers.

Broker Registration
ACLs
Dynamic Configs
Delegation Tokens

The ZK brokers still rely on the watch mechanism to learn about changes to these metadata. By performing dual writes, we cover these cases.

The controller will use a bounded write-behind approach for ZooKeeper updates. As we commit records to KRaft, we will asynchronously write data back to ZooKeeper. The number of pending ZK records will be bounded so that we can avoid excessive lag between the KRaft and ZooKeeper states.

This dual write approach ensures that any metadata seen in ZK will also be committed to KRaft.

ZK Broker RPCs

In order to support brokers that are still running in ZK mode, the KRaft controller will need to send out a few additional RPCs to keep things working in the broker.

LeaderAndIsr: when the KRaft controller handles AlterPartitions or performs a leader election, we will need to send LeaderAndIsr requests to ZK brokers.

UpdateMetadata: for certain metadata changes, the KRaft controller will need to send UpdateMetadataRequests to the ZK brokers. For the “ControllerId” field in this request, the controller should specify a random KRaft broker. Additionally, the controller must specify if a broker in “LiveBrokers” is KRaft or ZK.

StopReplicas: following reassignments and topic deletions, we will need to send StopReplicas to ZK brokers for them to stop managing certain replicas.

Controller Leadership

In order to prevent further writes to ZK, the first thing the new KRaft quorum must do is take over leadership of the ZK controller. This can be achieved by unconditionally writing a value into the “/controller” and “/controller_epoch” ZNodes. The active KRaft controller will write its node ID (e.g., 3000) into the ZNode as a persistent value. By writing a persistent value (rather than ephemeral), we can prevent any ZK brokers from ever claiming controller leadership.

If a KRaft controller failover occurs, the new active controller will overwrite the values in “/controller” and “/controller_epoch”.

Broker Registration

While running in migration mode, we must synchronize broker registration information bidirectionally between ZK and KRaft.

The KRaft controller will send UpdateMetadataRequests to ZK brokers to inform them of the other brokers in the cluster. This information is used by the brokers for the replication protocols. Similarly, the KRaft controller must know about ZK and KRaft brokers when performing operations like assignments and leader election.

ZK brokers, KRaft brokers, and the KRaft controller must know about all brokers in the cluster.

In order to discover which ZK brokers exist, the KRaft controller will need to read the “/brokers” state from ZK and copy it into the metadata log. Inversely, as KRaft brokers register with the KRaft controller, we must write this data back to ZK to prevent ZK brokers from registering with the same node ID.

AdminClient, MetadataRequest, and Forwarding

When a client bootstraps metadata from the cluster, it must receive the same metadata regardless of the type of broker it is bootstrapping from. Normally, ZK brokers return the active ZK controller as the ControllerId and KRaft brokers return a random alive KRaft broker. In both cases, this ControllerId is internally read from the MetadataCache on the broker.

Since we require controller forwarding for this KIP, we can use the KRaft approach of returning a random broker (ZK or KRaft) as the ControllerId and rely on forwarding for write operations.

However, we do not want to add the overhead of forwarding for inter-broker requests such as AlterPartitions and ControlledShutdown. In the UpdateMetadataRequest sent by the KRaft controller to the ZK brokers, the ControllerId will point to the active controller which will be used for the inter-broker requests.

Topic Deletions

The ZK migration logic will need to deal with asynchronous topic deletions when migrating data. Normally, the ZK controller will complete these asynchronous deletions via TopicDeletionManager. If the KRaft controller takes over before a deletion has occurred, we will need to complete the deletion as part of the ZK to KRaft state migration. Once the migration is complete, we will need to finalize the deletion in ZK so that the state is consistent.

Failure Modes

There are a few failure scenarios to consider during the migration. The KRaft controller can crash while initially copying the data from ZooKeeper, the controller can crash some time after the initial migration, and the controller can fail to write new metadata back to ZK.

For the initial migration, the controller will utilize KIP-868 Metadata Transactions to write all of the ZK metadata in a single transaction. If the controller fails before this transaction is finalized, the next active controller will abort the transaction and restart the migration process.

Once the data has been migrated and the cluster is the MigrationActive or MigrationFinished state, the KRaft controller may fail. If this happens, the Raft layer will elect a new leader which update the "/controller" and "/controller_epoch" ZNodes and take over the controller leadership as usual.

It is also possible for a write to ZK to fail. In this case,

Test Plan

Describe in few sentences how the KIP will be tested. We are mostly interested in system tests (since unit-tests are specific to implementation details). How will we know that the implementation works as expected? How will we know nothing broke?

Rejected Alternatives

Offline Migration

The main alternative to this design is to do an offline migration. While this would be much simpler, it would be a non-starter for many Kafka users who require minimal downtime of their cluster. By allowing for an online migration from ZK to KRaft, we can provide a path towards KRaft for all Kafka users – even ones where Kafka is critical infrastructure.

No Dual Writes

Another simplifying alternative would be to only write metadata into KRaft while in the migration mode. This has a few disadvantages. Primarily, it makes rolling back to ZK much more difficult, it at all possible. Secondly, we actually have a few remaining ZK read usages on the brokers that need the data in ZK to be up-to-date (see above section on Dual Metadata Writes).

Command/RPC based trigger

Another way to start the migration would be to have an operator issue a special command or send a special RPC. Adding human-driven manual steps like this to the migration may make it more difficult to integrate with orchestration software such as Anisble, Chef, Kubernetes, etc. By sticking with a "config and reboot" approach, the migration trigger is still simple, but easier to integrate into other control systems.

Write-ahead ZooKeeper data synchronization

TODO

Space shortcuts

Child pages

Status

Motivation

Public Interfaces

Metrics

New metadata.version (IBP)

Migration-mode configuration

MigrationCheck RPC

Migration State ZNode

Controller ZNodes

Operational Changes

Forwarding Enabled on Brokers

Migration Trigger

Migration Overview

Preparing the Cluster

Controller Migration

Broker Migration

Finalizing the Migration

Compatibility

Dual Metadata Writes

ZK Broker RPCs

Controller Leadership

Broker Registration

AdminClient, MetadataRequest, and Forwarding

Topic Deletions

Failure Modes

Test Plan

Rejected Alternatives

Offline Migration

No Dual Writes

Command/RPC based trigger

Write-ahead ZooKeeper data synchronization

Space shortcuts

Child pages

KIP-866 ZooKeeper to KRaft Migration

Status

Motivation

Public Interfaces

Metrics

New metadata.version (IBP)

Migration-mode configuration

MigrationCheck RPC

Migration State ZNode

Controller ZNodes

Operational Changes

Forwarding Enabled on Brokers

Migration Trigger

Migration Overview

Preparing the Cluster

Controller Migration

Broker Migration

Finalizing the Migration

Compatibility

Dual Metadata Writes

ZK Broker RPCs

Controller Leadership

Broker Registration

AdminClient, MetadataRequest, and Forwarding

Topic Deletions

Failure Modes

Test Plan

Rejected Alternatives

Offline Migration

No Dual Writes

Command/RPC based trigger

Write-ahead ZooKeeper data synchronization