Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Add section on meta.properties and controller soft start. Update broker registration section

...

This field will only be set by the KRaft controller when sending ApiVersionsResponse to other KRaft controllers. Since this migration does not support combined mode KRaft nodes, this field will never be seen by clients when receiving ApiVersionsResponse sent by brokers.

The initial supported values will be:

  • 0: Not Ready
  • 1: Ready
  • unset: Not a ZK controller

Migration Metadata Record

...

A new version of the broker registration RPC will be added to support ZK brokers registering with the KRaft quorum. A new tagged field is added to signify the "inter.broker.protocol.version" that a ZK broker was configured withis ready for migration. The presence of this field is used to indicate that the sending broker is a ZK broker. The usage of this RPC by a ZK broker indicates that it has "zookeeper.metadata.migration.enable" and quorum connection configs properly set. The values of this tagged field are the same as the equivalent field in ApiVersionsRequest.


Code Block
{
  "apiKey":62,
  "type": "request",
  "listeners": ["controller"],
  "name": "BrokerRegistrationRequest",
  "validVersions": "0-1",
  "flexibleVersions": "0+",
  "fields": [
    
--> { "name": "BrokerIdZkMigrationReady", "type": "int32int8", "versions": "1+", "taggedVersions": "01+", "entityTypetag": 1, "brokerIdignorable": true,
      "about": "TheSet by a ZK broker ID if the required configurations for ZK migration are present." },
    { "name": "ClusterId", <--- new field
  ]
}

RegisterBrokerRecord

A new field is added to signify that a registered broker is a ZooKeeper broker.

Code Block
{
  "apiKey": 0,
  "type": "stringmetadata",
  "versionsname": "0+RegisterBrokerRecord",
      "aboutvalidVersions": "The cluster id of the broker process." },
    0-2",
  "flexibleVersions": "0+",
  "fields": [
    ...
    { "name": "IncarnationIdZkMigrationReady", "type": "uuidint8", "versions": "02+",
      "abouttaggedVersions": "The incarnation id of the broker process." },
    { "name": "Listeners", "type": "[]Listener"2+", "tag": 1, "ignorable": true,
      "about": "TheSet listenersby ofa thisZK broker", "versions": "0+", "fields": [
      { "name": "Name", "type": "string", "versions": "0+", "mapKey": true,
        "about": "The name of the endpoint." },
      { "name": "Host", "type": "string", "versions": "0+",
        "about": "The hostname." },
      { "name": "Port", "type": "uint16", "versions": "0+",
        "about": "The port." },
      { "name": "SecurityProtocol", "type": "int16", "versions": "0+",
        "about": "The security protocol." }
    ]
    },
    { "name": "Features", "type": "[]Feature",
      "about": "The features on this broker", "versions": "0+", "fields": [
      { "name": "Name", "type": "string", "versions": "0+", "mapKey": true,
        "about": "The feature name." },
      { "name": "MinSupportedVersion", "type": "int16", "versions": "0+",
        "about": "The minimum supported feature level." },
      { "name": "MaxSupportedVersion", "type": "int16", "versions": "0+",
        "about": "The maximum supported feature level." }
    ]
    },
    { "name": "Rack", "type": "string", "versions": "0+", "nullableVersions": "0+",
      "about": "The rack which this broker is in." },
--> { "name": "InterBrokerProtocolVersion", "type": "string", "versions": "1+", "taggedVersions": "1+", "tag": 1, "ignorable": true,
      "about": "The static IBP that the broker was started with. This is only used by ZK brokers during a migration."}  <--- new field
  ]
}

Migration State ZNode

As part of the propagation of KRaft metadata back to ZooKeeper while in dual-write mode, we need to keep track of what has been synchronized. A new ZNode will be introduced to keep track of which KRaft record offset has been written back to ZK. This will be used to recover the synchronization state following a KRaft controller failover. 

Code Block
ZNode /migration

{
  "version": 0,
  "kraft_controller_id": 3000,
  "kraft_controller_epoch": 1,
  "kraft_metadata_offset": 1234,
  "kraft_metadata_epoch": 10
}

By using conditional updates on this ZNode, will can fence old KRaft controllers from synchronizing data to ZooKeeper if there has been a new election.

Controller ZNodes

The two controller ZNodes "/controller" and "/controller_epoch" will be managed by the KRaft quorum during the migration. More details in "Controller Leadership" section below. 

A new version of the JSON schema for "/controller" will be added to include a "isKRaft" boolean field.

Code Block
{
  "version": 2,
  "brokerid": 3000,
  "timestamp": 1234567890,
  "isKRaft": true          <-- new field
}

This field is intended to be informational to aid with debugging.

Operational Changes

Forwarding Enabled on Brokers

As detailed in KIP-500 and KIP-590, all brokers (ZK and KRaft) must forward administrative requests such as CreateTopics to the active KRaft controller once the migration has started. When running the new metadata.version defined in this KIP, all brokers will enable forwarding.

Additional ZK Broker Configs 

To support connecting to a KRaft controller for requests such as AlterPartitions, the ZK brokers will need additional configs

  • controller.quorum.voters: comma-separate list of "node@host:port" (the same as KRaft brokers would set)
  • controller.listener.names: a comma-separated list of listeners used by the controller
  • Corresponding entries in listener.security.property.map for the listeners given in controller.listener.names

Additional KRaft Broker Configs 

To support connecting to ZooKeeper during the migration, the KRaft controllers will need additional configs

  • zookeeper.connect (required)
  • zookeeper.connection.timeout.ms (optional)
  • zookeeper.session.timeout.ms (optional)
  • zookeeper.max.in.flight.requests (optional)
  • zookeeper.set.acl (optional)
  • ZooKeeper SSL configs (optional)

These configs should match the ZK configs in use by the ZK controller.

Migration Trigger

The migration from ZK to KRaft will be triggered by the cluster's state. To start a migration, the cluster must meet some requirements:

  1. Brokers have inter.broker.protocol.version set to the version added by this KIP to enable forwarding and indicate they are at the minimum software version
  2. Brokers have zookeeper.metadata.migration.enable set to “true”. This indicates an operator has declared some intention to start the migration.
  3. Brokers have the configs in "Additional ZK Broker Configs" set. This allows them to connect to the KRaft controller.
  4. No brokers are offline (we will use offline replicas as a proxy for this).
  5. The KRaft quorum is online and all members have zookeeper.metadata.migration.enable set to "true" as well as ZK configs set.

The operator can prepare the ZK brokers or KRaft controller in either order. The migration will only begin once every node is ready.

By utilizing configs and broker/controller restarts, we follow a paradigm that Kafka operators are familiar with.

Migration Overview

Here is a state machine description of the migration. There will likely be more internal states that the controller uses, but these four will be exposed as the ZkMigrationState metric.

...

State

...

Description

...

MigrationIneligible

...

The brokers and controllers do not meet the migration criteria. The cluster is operating in ZooKeeper mode.

...

MigratingZkData

...

The controller is copying data from ZooKeeper into KRaft.

...

DualWriteMetadata

...

The controller is in KRaft mode making dual writes to ZooKeeper.

...

MigrationFinalized

...

The cluster has been migrated to KRaft mode.

 if the required configurations for ZK migration are present" }
  ]
}


Migration State ZNode

As part of the propagation of KRaft metadata back to ZooKeeper while in dual-write mode, we need to keep track of what has been synchronized. A new ZNode will be introduced to keep track of which KRaft record offset has been written back to ZK. This will be used to recover the synchronization state following a KRaft controller failover. 

Code Block
ZNode /migration

{
  "version": 0,
  "kraft_controller_id": 3000,
  "kraft_controller_epoch": 1,
  "kraft_metadata_offset": 1234,
  "kraft_metadata_epoch": 10
}

By using conditional updates on this ZNode, will can fence old KRaft controllers from synchronizing data to ZooKeeper if there has been a new election.

Controller ZNodes

The two controller ZNodes "/controller" and "/controller_epoch" will be managed by the KRaft quorum during the migration. More details in "Controller Leadership" section below. 

A new version of the JSON schema for "/controller" will be added to include a "isKRaft" boolean field.

Code Block
{
  "version": 2,
  "brokerid": 3000,
  "timestamp": 1234567890,
  "isKRaft": true          <-- new field
}

This field is intended to be informational to aid with debugging.

Operational Changes

Forwarding Enabled on Brokers

As detailed in KIP-500 and KIP-590, all brokers (ZK and KRaft) must forward administrative requests such as CreateTopics to the active KRaft controller once the migration has started. When running the new metadata.version defined in this KIP, all brokers will enable forwarding.

Additional ZK Broker Configs 

To support connecting to a KRaft controller for requests such as AlterPartitions, the ZK brokers will need additional configs

  • controller.quorum.voters: comma-separate list of "node@host:port" (the same as KRaft brokers would set)
  • controller.listener.names: a comma-separated list of listeners used by the controller
  • Corresponding entries in listener.security.property.map for the listeners given in controller.listener.names

Additional KRaft Broker Configs 

To support connecting to ZooKeeper during the migration, the KRaft controllers will need additional configs

  • zookeeper.connect (required)
  • zookeeper.connection.timeout.ms (optional)
  • zookeeper.session.timeout.ms (optional)
  • zookeeper.max.in.flight.requests (optional)
  • zookeeper.set.acl (optional)
  • ZooKeeper SSL configs (optional)

These configs should match the ZK configs in use by the ZK controller.

Migration Trigger

The migration from ZK to KRaft will be triggered by the cluster's state. To start a migration, the cluster must meet some requirements:

  1. Brokers have inter.broker.protocol.version set to the version added by this KIP to enable forwarding and indicate they are at the minimum software version
  2. Brokers have zookeeper.metadata.migration.enable set to “true”. This indicates an operator has declared some intention to start the migration.
  3. Brokers have the configs in "Additional ZK Broker Configs" set. This allows them to connect to the KRaft controller.
  4. No brokers are offline (we will use offline replicas as a proxy for this).
  5. The KRaft quorum is online and all members have zookeeper.metadata.migration.enable set to "true" as well as ZK configs set.

The operator can prepare the ZK brokers or KRaft controller in either order. The migration will only begin once every node is ready.

By utilizing configs and broker/controller restarts, we follow a paradigm that Kafka operators are familiar with.

Migration Overview

Here is a state machine description of the migration. There will likely be more internal states that the controller uses, but these four will be exposed as the ZkMigrationState metric.


State

Enum

Description

None0This cluster started out as KRaft and was not migrated.

MigrationIneligible

1

The brokers and controllers do not meet the migration criteria. The cluster is operating in ZooKeeper mode.

MigratingZkData

2

The controller is copying data from ZooKeeper into KRaft.

DualWriteMetadata

3

The controller is in KRaft mode making dual writes to ZooKeeper.

MigrationFinalized

4

The cluster has been migrated to KRaft mode.


The active ZooKeeper controller always reports "MigrationIneligible" while the active KRaft controller reports the state corresponding to the state of the migration.

Preparing the Cluster

The first step of the migration is to upgrade the cluster to at least the bridge release version. Upgrading the cluster to a well known starting point will reduce our compatibility matrix and ensure that the necessary logic is in place prior to the migration. The brokers must also set the configs defined above in "Migration Trigger".

To proceed with the migration, all brokers should be online to ensure they satisfy the criteria for the migration. 

Controller Migration

This migration only supports dedicated KRaft controllers as the target deployment. There will be no support for migrating to a combined broker/controller KRaft deployment.

A new set of nodes will be provisioned to host the controller quorum. These controllers will be started with zookeeper.metadata.migration.enable set to “true”. Once the quorum is established and a leader is elected, the active controller will check that the whole quorum is ready to begin the migration. This is done by examining the new tagged field on ApiVersionsResponse that is exchanged between controllers. Following this, the controller will examine the state of the ZK broker registrations and wait for incoming BrokerRegistration requests. Once all ZK brokers have registered with the KRaft controller (and they are in a valid state) the migration process will begin.

There is no ordering dependency between configuring ZK brokers for the migration and bringing up the KRaft quorum. 

The first step in the migration is to copy the existing metadata from ZK and write it into the KRaft metadata log. The active controller will also establish itself as the active controller from a ZK perspective. While copying the ZK data, the controller will not handle any RPCs from brokers. 

The metadata migration process will cause controller downtime proportional to the total size of metadata in ZK. 

The metadata copied from ZK will be encapsulated in a single metadata transaction (KIP-868). A MigrationRecord will also be included in this transaction. 

At this point, all of the brokers are running in ZK mode and their broker-controller communication channels operate as they would with a ZK controller. The ZK brokers will learn about this new controller by receiving an UpdateMetadataRequest from the new KRaft controller. From a broker’s perspective, the controller looks and behaves like a normal ZK controller. 

Metadata changes are now written to the KRaft metadata log as well as ZooKeeper. 

This dual-write mode will write metadata to both the KRaft metadata log and ZooKeeper.

In order to ensure consistency of the metadata, we must stop making any writes to ZK while we are migrating the data. This is accomplished by forcing the new KRaft controller to be the active ZK controller by forcing a write to the "/controller" and "/controller_epoch" ZNodes.

Broker Migration

Following the migration of metadata and controller leadership to KRaft, the brokers are restarted one-by-one in KRaft mode. While this rolling restart is taking place, the cluster will be composed of both ZK and KRaft brokers. 

The broker migration phase does not cause downtime, but it is effectively unbounded in its total duration. 

There is likely no reasonable way to put a limit on how long a cluster stays in a mixed state since rolling restarts for large clusters may take several hours. It is also possible for the operator to revert back to ZK during this time.

Finalizing the Migration

Once the cluster has been fully upgraded to KRaft mode, the controller will still be running in migration mode and making dual writes to KRaft and ZK. Since the data in ZK is still consistent with that of the KRaft metadata log, it is still possible to revert back to ZK.

The time that the cluster is running all KRaft brokers/controllers, but still running in migration mode, is effectively unbounded.

Once the operator has decided to commit to KRaft mode, the final step is to restart the controller quorum and take it out of migration mode by setting zookeeper.metadata.migration.enable to "false" (or unsetting it). The active controller will only finalize the migration once it detects that all members of the quorum have signaled that they are finalizing the migration (again, using the tagged field in ApiVersionsResponse). Once the controller leaves migration mode, it will write a MigrationRecord to the log and no longer perform writes to ZK. It will also disable its special handling of ZK RPCs.

At this point, the cluster is fully migrated and is running in KRaft mode. A rollback to ZK is still possible after finalizing the migration, but it must be done offline and it will cause metadata loss (which can also cause partition data loss).

Implementation and Compatibility

Dual Metadata Writes

Metadata will be written to the KRaft metadata log as well as to ZooKeeper during the migration. This gives us two important guarantees: we have a safe path back to ZK mode and compatibility with ZK broker metadata that relies on ZK watches.

At any time during the migration, it should be possible for the operator to decide to revert back to ZK mode. This process should be safe and straightforward. By writing all metadata updates to both KRaft and ZK, we can ensure that the state stored in ZK is up-to-date.

By writing metadata changes to ZK, we also maintain compatibility with a few remaining direct ZK dependencies that exist on the ZK brokers. 

  • ACLs
  • Dynamic Configs
  • Delegation Tokens

The ZK brokers still rely on the watch mechanism to learn about changes to these metadata. By performing dual writes, we cover these cases.

The controller will use a bounded write-behind approach for ZooKeeper updates. As we commit records to KRaft, we will asynchronously write data back to ZooKeeper. The number of pending ZK records will be reported as a metric so we can monitor how far behind the ZK state is from KRaft. We may also determine a bound on the number of records not yet written to ZooKeeper to avoid excessive difference between the KRaft and ZooKeeper states.

In order to ensure consistency of the data written back to ZooKeeper, we will leverage ZooKeeper multi-operation transactions. With each "multi" op sent to ZooKeeper, we will include the data being written (e.g., topics, configs, etc) along with a conditional update to the "/migration" ZNode. The contents of "/migration" will be updated with each write to include the offset of the latest record being written back to ZooKeeper. By using the conditional update, we can avoid races between KRaft controllers during a failover and ensure consistency between the metadata log and ZooKeeper.

Another benefit of using multi-operation transactions when synchronizing metadata to ZooKeeper is that we reduce the number of round-trips to ZooKeeper. This pipelining technique is also utilized by the ZK controller for performance reasons.

This dual write approach ensures that any metadata seen in ZK has also been committed to KRaft.

ZK Broker RPCs

In order to support brokers that are still running in ZK mode, the KRaft controller will need to send out additional RPCs to keep the metadata of the ZK brokers up-to-date. 

LeaderAndIsr: when the KRaft controller handles AlterPartitions or performs a leader election, we will need to send LeaderAndIsr requests to ZK brokers. 

UpdateMetadata: for metadata changes, the KRaft controller will need to send UpdateMetadataRequests to the ZK brokers. Instead of ControllerId, the KRaft controller will specify itself using KRaftControllerId field.

StopReplicas: following reassignments and topic deletions, we will need to send StopReplicas to ZK brokers for them to stop managing certain replicas. 

Each of these RPCs will include a new KRaftControllerId field that points to the active KRaft controller. When this field is present, it acts as a signal to the brokers that the controller is in KRaft mode. Using this field, and the zookeeper.metadata.migration.enable config, the brokers can enable migration specific behavior. 

Controller Leadership

In order to prevent further writes to ZK, the first thing the new KRaft quorum must do is take over leadership of the ZK controller. This can be achieved by unconditionally overwriting two values in ZK. The "/controller" ZNode indicates the current active controller. By overwriting it, a watch will fire on all the ZK brokers to inform them of a new controller election. The active KRaft controller will write its node ID (e.g., 3000) into this ZNode to claim controller leadership. This write will be persistent rather than the usual ephemeral write used by the ZK controller election algorithm. This will ensure that no ZK broker can claim leadership during a KRaft controller failover.

The second ZNode we will write to is "/controller_epoch". This ZNode is used for fencing writes from old controllers in ZK mode. Each write from a ZK controller is actually a conditional multi-write with a "check" operation on the "/controller_epoch" ZNode's version. By altering this node, we can ensure any in-flight writes from the previous ZK controller epoch will fail.

Every time a KRaft controller election occurs, the newly elected controller will overwrite the values in “/controller” and “/controller_epoch”. The first epoch generated by the KRaft quroum must be greater than the last ZK epoch in order to maintain the monotonic epoch invariant.

Broker Registration

While running in migration mode, the KRaft controller must know about KRaft brokers as well as ZK brokers. This will be accomplished by having the ZK brokers send the broker lifecycle RPCs to the KRaft controller.

A new version of the BrokerRegistration RPC will be used by the ZK brokers to register themselves with KRaft. The ZK brokers will set the new ZkMigrationReady field and populate the Features field with a "metadata.version" min and max supported equal to their IBP. The KRaft controller will only accept the registration if the given "metadata.version" is equal to the IBP/MetadataVersion of the quorum. The controller will also only accept the registration if the ZkMigrationReady has a valid value.

After successfully registering, the ZK brokers will send BrokerHeartbeat RPCs to indicate liveness. The ZK brokers will learn about other brokers in the usual way through UpdateMetadataRequest.

If a ZK broker attempts to register with an invalid node ID, cluster ID, or IBP, the KRaft controller will reject the registration and the broker will terminate.

If a KRaft broker attempts to register itself with the node ID of an existing ZK broker, the controller will reject the registration and the broker will terminate.

KRaft Controller Soft Start

When the KRaft quorum is first established prior to starting a migration, it should not handle most RPCs until the initial data migration from ZooKeeper has completed. This is necessary to prevent divergence of metadata during the initial data migration. The controller will need to process RPCs related to Raft as well as BrokerRegistration and BrokerHeartbeat. Other RPCs (such as CreateTopics) will be rejected with a NOT_CONTROLLER error.

Once the metadata migration is complete, the KRaft controller will begin operating normally.

AdminClient, MetadataRequest, and Forwarding

When a client bootstraps metadata from the cluster, it must receive the same metadata regardless of the type of broker it is bootstrapping from. Normally, ZK brokers return the active ZK controller as the ControllerId and KRaft brokers return a random alive KRaft broker. In both cases, this ControllerId is internally read from the MetadataCache on the broker.

Since we require controller forwarding for this KIP, we can use the KRaft approach of returning a random broker (ZK or KRaft) as the ControllerId for clients via MetadataResponse and rely on forwarding for write operations.

For inter-broker requests such as AlterPartitions and ControlledShutdown, we do not want to add the overhead of forwarding so we'll want to include the actual controller in the UpdateMetadataRequest. However, we cannot simply include the KRaft controller as the ControllerId. The ZK brokers connect to a ZK controller by using the "inter.broker.listener.name" config and the node information from LiveBrokers in the UpdateMetadataRequest. For connecting to a KRaft controller, the ZK brokers will need to use the "controller.listener.names" and "controller.quorum.voters" configs. To allow this, we will use the new KRaftControllerId field in UpdateMetadataRequest.

Topic Deletions

The ZK migration logic will need to deal with asynchronous topic deletions when migrating data. Normally, the ZK controller will complete these asynchronous deletions via TopicDeletionManager. If the KRaft controller takes over before a deletion has occurred, we will need to complete the deletion as part of the ZK to KRaft state migration. Once the migration is complete, we will need to finalize the deletion in ZK so that the state is consistent.

Meta.Properties

Both ZK and KRaft brokers maintain a meta.properties file in their log directories to store the ID of the node and the cluster. Each broker type uses a different version of this file.


v0 is used by ZK brokers:

Code Block
#
#Tue Nov 29 10:15:56 EST 2022
broker.id=0
version=0
cluster.id=L05pbYc6Q4qlvxLk3rTO9A


v1 is used by KRaft brokers and controllers:

Code Block
#
#Tue Nov 29 10:16:40 EST 2022
node.id=2
version=1
cluster.id=L05pbYc6Q4qlvxLk3rTO9A


Since these two versions contain the same data, but with different field names, we can simply support v0 and v1 in KRaft brokers and avoid modifying the file on disk. By leaving this file unchanged, we better facilitate a downgrade to ZK during the migration

The active ZooKeeper controller always reports "MigrationIneligible" while the active KRaft controller reports the state corresponding to the state of the migration.

Preparing the Cluster

The first step of the migration is to upgrade the cluster to at least the bridge release version. Upgrading the cluster to a well known starting point will reduce our compatibility matrix and ensure that the necessary logic is in place prior to the migration. The brokers must also set the configs defined above in "Migration Trigger".

To proceed with the migration, all brokers should be online to ensure they satisfy the criteria for the migration. 

Controller Migration

This migration only supports dedicated KRaft controllers as the target deployment. There will be no support for migrating to a combined broker/controller KRaft deployment.

A new set of nodes will be provisioned to host the controller quorum. These controllers will be started with zookeeper.metadata.migration.enable set to “true”. Once the quorum is established and a leader is elected, the active controller will check that the whole quorum is ready to begin the migration. This is done by examining the new tagged field on ApiVersionsResponse that is exchanged between controllers. Following this, the controller will examine the state of the ZK broker registrations and wait for incoming BrokerRegistration requests. Once all ZK brokers have registered with the KRaft controller (and they are in a valid state) the migration process will begin.

There is no ordering dependency between configuring ZK brokers for the migration and bringing up the KRaft quorum. 

The first step in the migration is to copy the existing metadata from ZK and write it into the KRaft metadata log. The active controller will also establish itself as the active controller from a ZK perspective. While copying the ZK data, the controller will not handle any RPCs from brokers. 

The metadata migration process will cause controller downtime proportional to the total size of metadata in ZK. 

The metadata copied from ZK will be encapsulated in a single metadata transaction (KIP-868). A MigrationRecord will also be included in this transaction. 

At this point, all of the brokers are running in ZK mode and their broker-controller communication channels operate as they would with a ZK controller. The ZK brokers will learn about this new controller by receiving an UpdateMetadataRequest from the new KRaft controller. From a broker’s perspective, the controller looks and behaves like a normal ZK controller. 

Metadata changes are now written to the KRaft metadata log as well as ZooKeeper. 

This dual-write mode will write metadata to both the KRaft metadata log and ZooKeeper.

In order to ensure consistency of the metadata, we must stop making any writes to ZK while we are migrating the data. This is accomplished by forcing the new KRaft controller to be the active ZK controller by forcing a write to the "/controller" and "/controller_epoch" ZNodes.

Broker Migration

Following the migration of metadata and controller leadership to KRaft, the brokers are restarted one-by-one in KRaft mode. While this rolling restart is taking place, the cluster will be composed of both ZK and KRaft brokers. 

The broker migration phase does not cause downtime, but it is effectively unbounded in its total duration. 

There is likely no reasonable way to put a limit on how long a cluster stays in a mixed state since rolling restarts for large clusters may take several hours. It is also possible for the operator to revert back to ZK during this time.

Finalizing the Migration

Once the cluster has been fully upgraded to KRaft mode, the controller will still be running in migration mode and making dual writes to KRaft and ZK. Since the data in ZK is still consistent with that of the KRaft metadata log, it is still possible to revert back to ZK.

The time that the cluster is running all KRaft brokers/controllers, but still running in migration mode, is effectively unbounded.

Once the operator has decided to commit to KRaft mode, the final step is to restart the controller quorum and take it out of migration mode by setting zookeeper.metadata.migration.enable to "false" (or unsetting it). The active controller will only finalize the migration once it detects that all members of the quorum have signaled that they are finalizing the migration (again, using the tagged field in ApiVersionsResponse). Once the controller leaves migration mode, it will write a MigrationRecord to the log and no longer perform writes to ZK. It will also disable its special handling of ZK RPCs.

At this point, the cluster is fully migrated and is running in KRaft mode. A rollback to ZK is still possible after finalizing the migration, but it must be done offline and it will cause metadata loss (which can also cause partition data loss).

Implementation and Compatibility

Dual Metadata Writes

Metadata will be written to the KRaft metadata log as well as to ZooKeeper during the migration. This gives us two important guarantees: we have a safe path back to ZK mode and compatibility with ZK broker metadata that relies on ZK watches.

At any time during the migration, it should be possible for the operator to decide to revert back to ZK mode. This process should be safe and straightforward. By writing all metadata updates to both KRaft and ZK, we can ensure that the state stored in ZK is up-to-date.

By writing metadata changes to ZK, we also maintain compatibility with a few remaining direct ZK dependencies that exist on the ZK brokers. 

  • ACLs
  • Dynamic Configs
  • Delegation Tokens

The ZK brokers still rely on the watch mechanism to learn about changes to these metadata. By performing dual writes, we cover these cases.

The controller will use a bounded write-behind approach for ZooKeeper updates. As we commit records to KRaft, we will asynchronously write data back to ZooKeeper. The number of pending ZK records will be reported as a metric so we can monitor how far behind the ZK state is from KRaft. We may also determine a bound on the number of records not yet written to ZooKeeper to avoid excessive difference between the KRaft and ZooKeeper states.

In order to ensure consistency of the data written back to ZooKeeper, we will leverage ZooKeeper multi-operation transactions. With each "multi" op sent to ZooKeeper, we will include the data being written (e.g., topics, configs, etc) along with a conditional update to the "/migration" ZNode. The contents of "/migration" will be updated with each write to include the offset of the latest record being written back to ZooKeeper. By using the conditional update, we can avoid races between KRaft controllers during a failover and ensure consistency between the metadata log and ZooKeeper.

Another benefit of using multi-operation transactions when synchronizing metadata to ZooKeeper is that we reduce the number of round-trips to ZooKeeper. This pipelining technique is also utilized by the ZK controller for performance reasons.

This dual write approach ensures that any metadata seen in ZK has also been committed to KRaft.

ZK Broker RPCs

In order to support brokers that are still running in ZK mode, the KRaft controller will need to send out additional RPCs to keep the metadata of the ZK brokers up-to-date. 

LeaderAndIsr: when the KRaft controller handles AlterPartitions or performs a leader election, we will need to send LeaderAndIsr requests to ZK brokers. 

UpdateMetadata: for metadata changes, the KRaft controller will need to send UpdateMetadataRequests to the ZK brokers. Instead of ControllerId, the KRaft controller will specify itself using KRaftControllerId field.

StopReplicas: following reassignments and topic deletions, we will need to send StopReplicas to ZK brokers for them to stop managing certain replicas. 

Each of these RPCs will include a new KRaftControllerId field that points to the active KRaft controller. When this field is present, it acts as a signal to the brokers that the controller is in KRaft mode. Using this field, and the zookeeper.metadata.migration.enable config, the brokers can enable migration specific behavior. 

Controller Leadership

In order to prevent further writes to ZK, the first thing the new KRaft quorum must do is take over leadership of the ZK controller. This can be achieved by unconditionally overwriting two values in ZK. The "/controller" ZNode indicates the current active controller. By overwriting it, a watch will fire on all the ZK brokers to inform them of a new controller election. The active KRaft controller will write its node ID (e.g., 3000) into this ZNode to claim controller leadership. This write will be persistent rather than the usual ephemeral write used by the ZK controller election algorithm. This will ensure that no ZK broker can claim leadership during a KRaft controller failover.

The second ZNode we will write to is "/controller_epoch". This ZNode is used for fencing writes from old controllers in ZK mode. Each write from a ZK controller is actually a conditional multi-write with a "check" operation on the "/controller_epoch" ZNode's version. By altering this node, we can ensure any in-flight writes from the previous ZK controller epoch will fail.

Every time a KRaft controller election occurs, the newly elected controller will overwrite the values in “/controller” and “/controller_epoch”. The first epoch generated by the KRaft quroum must be greater than the last ZK epoch in order to maintain the monotonic epoch invariant.

Broker Registration

While running in migration mode, the KRaft controller must know about KRaft brokers as well as ZK brokers. The current set of live brokers will be sent to ZK brokers using UpdateMetadataRequest and sent to KRaft brokers using BrokerRegistration[Change]Record in the metadata log. 

A new version of the BrokerRegistration RPC will be used by the ZK brokers to register themselves with KRaft. The usage of this RPC by a ZK broker indicates that it is properly configured for the migration. The new InterBrokerProtocolVersion tagged field in the RPC is used by the KRaft controller to verify that the whole cluster is using the same IBP/MetadataVersion before starting the migration.

After registering, ZK brokers will send BrokerHeartbeat RPCs to indicate liveness. 

If a ZK broker comes online and registers itself with a nodeId of an existing KRaft broker, we will log en error and fence the errant ZK broker by not sending it UpdateMetadataRequests.

If a KRaft broker attempts to register itself with a nodeId of an existing ZK broker, the controller will refuse the registration and the broker will terminate.

AdminClient, MetadataRequest, and Forwarding

When a client bootstraps metadata from the cluster, it must receive the same metadata regardless of the type of broker it is bootstrapping from. Normally, ZK brokers return the active ZK controller as the ControllerId and KRaft brokers return a random alive KRaft broker. In both cases, this ControllerId is internally read from the MetadataCache on the broker.

Since we require controller forwarding for this KIP, we can use the KRaft approach of returning a random broker (ZK or KRaft) as the ControllerId for clients via MetadataResponse and rely on forwarding for write operations.

For inter-broker requests such as AlterPartitions and ControlledShutdown, we do not want to add the overhead of forwarding so we'll want to include the actual controller in the UpdateMetadataRequest. However, we cannot simply include the KRaft controller as the ControllerId. The ZK brokers connect to a ZK controller by using the "inter.broker.listener.name" config and the node information from LiveBrokers in the UpdateMetadataRequest. For connecting to a KRaft controller, the ZK brokers will need to use the "controller.listener.names" and "controller.quorum.voters" configs. To allow this, we will use the new KRaftControllerId field in UpdateMetadataRequest.

Topic Deletions

The ZK migration logic will need to deal with asynchronous topic deletions when migrating data. Normally, the ZK controller will complete these asynchronous deletions via TopicDeletionManager. If the KRaft controller takes over before a deletion has occurred, we will need to complete the deletion as part of the ZK to KRaft state migration. Once the migration is complete, we will need to finalize the deletion in ZK so that the state is consistent.

Rollback to ZK

As mentioned above, it should be possible for the operator to rollback to ZooKeeper at any point in the migration process prior to taking the KRaft controllers out of migration mode. The procedure for rolling back is to reverse the steps of the migration that had been completed so far. 

...

Another way to start the migration would be to have an operator issue a special command or send a special RPC. Adding human-driven manual steps like this to the migration may make it more difficult to integrate with orchestration software such as Anisble, Chef, Kubernetes, etc. By sticking with a "config and reboot" approach, the migration trigger is still simple, but easier to integrate into other control systems.

Write-ahead ZooKeeper data synchronization

...

simple, but easier to integrate into other control systems.

Write-ahead ZooKeeper data synchronization

An alternative to write-behind for ZooKeeper would be to write first to ZooKeeper and then write to the metadata log. The main problem with this approach is that it will make KRaft writes much slower since ZK will always be in the write path. By doing a write-behind with offset tracking, we can amortize the ZK write latency and possibly be more efficient about making bulk writes to ZK. 

Combined Mode Migration Support

Since combined mode is primarily intended for developer environments, support for migrations under combined mode was not considered a priority for this design. By excluding it from this initial design, we can simply the implementation and reduce the testing matrix by an entire dimension. The migration design is already very complex, so any reduction in scope is beneficial. In the future, it is possible that we could add support for combined mode migrations based on this design.