Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Removed MigrationCheck RPC, added UMR version bump, add broker registration JSON bump, clarified many things from discussion on mailing list

...

Current state: In Discussion

Discussion thread: https://lists.apache.org/thread/phnrz31dj0jz44kcjmvzrrmhhsmbx945

JIRA:

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyKAFKA-14304

...

MBean nameDescription
kafka.server:type=KafkaServer,name=MetadataType

An enumeration of: ZooKeeper (1), Dual (2), KRaft (3). Each broker reports this.

kafka.controller:type=KafkaController,name=Features,feature={feature},level={level}The finalized set of features with their level as seen by the controller. Used to help operators see the cluster's current metadata.version
kafka.controller:type=KafkaController,name=MigrationStateAn enumeration of the possible migration states the cluster can be in. This is only reported by the active controller. The "ZooKeeper" and "MigrationEligible" states are reported by the ZK controller, while the remaining states are reported by the KRaft controller.
kafka.controller:type=KafkaController,name=ZooKeeperWriteBehindLagThe amount of lag in records that ZooKeeper is behind relative to the highest committed record in the metadata log. This metric will only be reported by the active KRaft controller.
kafka.controller:type=KafkaController,name=ZooKeeperBlockingKRaftMillisThe number of milliseconds a write to KRaft has been blocked due to lagging ZooKeeper writes. This metric will only be reported by the active KRaft controller.

MetadataVersion (IBP)

A new metadata.version will be used new MetadataVersion in the 3.4 line will be added. This version will be used for a few things in this design.

...

All brokers must be running this metadata.version before at least this MetadataVersion before the migration can begin.  

Configuration

ZK brokers will specify their MetadataVersion using the inter.broker.protocol.version as usual. The KRaft controller will bootstrap with the same MetadataVersion (which is stored in the metadata log as a feature flag – see KIP-778: KRaft to KRaft Upgrades).

Configuration

A new “kafka.metadata.migration.enableA new “kafka.metadata.migration.enable” config will be added for the ZK broker and KRaft controller. Its default will be “false”. Setting this config to “true” on the brokers each broker is a prerequisite to starting the migration. Setting this to "true" on the KRaft controllers is the trigger for starting the migration (more on that below).

MigrationCheck RPC

Brokers will use the new metadata.version to enable a new MigrationCheck RPC. This RPC will be used by the KRaft controller to determine if the cluster is ready to be migrated. The response will include the cluster ID and a boolean indicating if the migration mode config has been enabled statically on this broker.

The purpose of this RPC is to signal that a broker is able to be migrated. When the KRaft controller begins the migration process, it will first check that the live brokers are able to be migrated.

Request:

Setting this to "true" (or "false") on a KRaft broker has no affect.

ZK Broker Registration JSON

In order to inform the KRaft controller that a ZK broker is ready for migration, a new version of the broker registration JSON will be added. This new version (6) will add a kraftMigration object. The object will include properties needed by the KRaft controller to begin the migration. The usage of this new version (and field) will be gated on the MetadataVersion introduced by this KIP.

Code Block
{
  "apiKeyversion": TBD6,
  "typehost": "requestbroker01",
  "nameport": "MigrationCheckRequest"9092,
  "validVersionsjmx_port": "0"9999,
  "flexibleVersionstimestamp": "0+"2233345666,
  "fieldsendpoints": [ ]
}

Response:

Code Block
{],
  "apiKeyrack": TBD"",
  "type"features: "response"{},
  "namekraftMigration": "MigrationCheckResponse",
 { // <-- New object
    "validVersionsisReady": "0",true
    "flexibleVersionsclusterId": "0+uKMoqJEZRSWt0uDX44O5Wg",
    "fieldsibp": [ "3.4-IV0"
    {"name": "clusterId": "type": "uuid", "versions": "0+"},
    {"name": "configEnabled": "type": "boolean", "versions": "0+"}
  ]
}

Migration State ZNode

As part of the propagation of KRaft metadata back to ZooKeeper while in dual-write mode, we need to keep track of what has been synchronized. A new ZNode will be introduced to keep track of which KRaft record offset has been written back to ZK. This will be used to recover the synchronization state following a KRaft controller failover. 

Code Block
ZNode /migration

{
  "lastOffset": 100,
  "lastTimestamp": "2022-01-01T00:00:00.000Z",
  "kraftControllerId": 3000,
  "kraftControllerEpoch": 1
}

By using conditional updates on this ZNode, will can fence old KRaft controllers from synchronizing data to ZooKeeper if there has been a new election.

Controller ZNodes

The two controller ZNodes "/controller" and "/controller_epoch" will be managed by the KRaft quorum during the migration. Rather than using ephemeral ZNodes, the KRaft controller will use a persistent ZNode for "/controller" to prevent ZK brokers from attempting to become the active controller. The "/controller_epoch" ZNode will be managed by the active KRaft controller and incremented anytime a new KRaft controller is elected.

Operational Changes

Forwarding Enabled on Brokers

As detailed in KIP-500 and KIP-590, all brokers (ZK and KRaft) must forward administrative requests such as CreateTopics to the active KRaft controller once the migration has started. When running the new metadata.version defined in this KIP, all brokers will enable forwarding.

Migration Trigger

The migration from ZK to KRaft will be triggered by the cluster's state. To start a migration, the cluster must meet some requirements:

  1. The metadata.version is set to the version added by this KIP. This indicates the software is at a minimum version which includes the necessary logic to perform the migration.
  2. All ZK brokers have kafka.metadata.migration.enable set to “true”. This indicates an operator has declared some intention to start the migration.
  3. No brokers are offline (we will use offline replicas as a proxy for this).

Once these conditions are satisfied, an operator can start a KRaft quorum with kafka.metadata.migration.enable set to “true” to begin the migration.

By utilizing configs and broker/controller restarts, we follow a paradigm that Kafka operators are familiar with.

Migration Overview

Here is a state machine description of the migration. 

...

State

...

Description

...

ZooKeeper

...

The cluster is in ZooKeeper mode

...

MigrationEligible

...

The cluster has been upgraded to a minimum software version and has set the necessary static configs

...

MigrationReady

...

The KRaft quorum has been started

...

MigrationActive

...

ZK state has been migrated, controller is in dual-write mode, brokers are being restarted in KRaft mode

...

MigrationFinished

...

All of the brokers have been restarted in KRaft mode, controller still in dual-write mode

...

KRaft

...

The cluster is in KRaft mode

}  
}

In addition to checking if a broker is ready for migration, these new properties are used by the KRaft controller to verify that the brokers and new KRaft controllers have valid configurations. 

UpdateMetadataRequest

A new RPC version will be added which adds the field KRaftControllerId. This field will point to the active KRaft controller. If this field is set, the ControllerId field should be -1.

Code Block
{
  "apiKey": 6,
  "type": "request",
  "listeners": ["zkBroker"],
  "name": "UpdateMetadataRequest",
  "validVersions": "0-8",  <-- New version 8
  "flexibleVersions": "6+",
  "fields": [
    { "name": "ControllerId", "type": "int32", "versions": "0+", "entityType": "brokerId",
      "about": "The controller id." },
--> { "name": "KRaftControllerId", "type": "int32", "versions": "8+", "entityType": "brokerId",
      "about": "The KRaft controller id, used during migration." }, <-- New field
    { "name": "ControllerEpoch", "type": "int32", "versions": "0+",
      "about": "The controller epoch." },
    ...
   ]
}

Migration State ZNode

As part of the propagation of KRaft metadata back to ZooKeeper while in dual-write mode, we need to keep track of what has been synchronized. A new ZNode will be introduced to keep track of which KRaft record offset has been written back to ZK. This will be used to recover the synchronization state following a KRaft controller failover. 

Code Block
ZNode /migration

{
  "lastOffset": 100,
  "lastTimestamp": "2022-01-01T00:00:00.000Z",
  "kraftControllerId": 3000,
  "kraftControllerEpoch": 1
}

By using conditional updates on this ZNode, will can fence old KRaft controllers from synchronizing data to ZooKeeper if there has been a new election.

Controller ZNodes

The two controller ZNodes "/controller" and "/controller_epoch" will be managed by the KRaft quorum during the migration. More details in "Controller Leadership" section below.

Operational Changes

Forwarding Enabled on Brokers

As detailed in KIP-500 and KIP-590, all brokers (ZK and KRaft) must forward administrative requests such as CreateTopics to the active KRaft controller once the migration has started. When running the new metadata.version defined in this KIP, all brokers will enable forwarding.

Additional ZK Broker Configs 

To support connecting to a KRaft controller for requests such as AlterPartitions, the ZK brokers will need additional configs

  • controller.quorum.voters: comma-separate list of "node@host:port" (the same as KRaft brokers would set)
  • controller.listener.names: a comma-separated list of listeners used by the controller
  • Corresponding entries in listener.security.property.map for the listeners given in controller.listener.names

Migration Trigger

The migration from ZK to KRaft will be triggered by the cluster's state. To start a migration, the cluster must meet some requirements:

  1. Brokers have inter.broker.protocol.version set to the version added by this KIP to enable forwarding and indicate they are at the minimum software version
  2. Brokers have kafka.metadata.migration.enable set to “true”. This indicates an operator has declared some intention to start the migration.
  3. Brokers have the configs in "Additional ZK Broker Configs" set. This allows them to connect to the KRaft controller.
  4. No brokers are offline (we will use offline replicas as a proxy for this).

Once these conditions are satisfied, an operator can start a KRaft quorum with kafka.metadata.migration.enable set to “true” to begin the migration.

By utilizing configs and broker/controller restarts, we follow a paradigm that Kafka operators are familiar with.

Migration Overview

Here is a state machine description of the migration. 


State

Enum

Description

MigrationIneligible

1

The cluster is in ZooKeeper mode. The migration cannot yet begin.

MigrationEligible

2

The cluster has been upgraded to a minimum software version and has set the necessary static configs

MigrationReady

3

The KRaft quorum has been started

MigrationActive

4

ZK state has been migrated, controller is in dual-write mode, brokers are being restarted in KRaft mode

MigrationFinished

5

All of the brokers have been restarted in KRaft mode, controller still in dual-write mode

KRaft

6

The cluster is in KRaft mode


And a state machine diagram:

Image Added


Preparing the Cluster

The first step of the migration is to upgrade the cluster to at least the bridge release version. Upgrading the cluster to a well known starting point will reduce our compatibility matrix and ensure that the necessary logic is in place prior to the migration. The brokers must also set the configs defined above in "Migration Trigger".

To proceed with the migration, all brokers should be online to ensure they satisfy the criteria for the migration. 

Controller Migration

This migration only supports dedicated KRaft controllers as the target deployment. There will be no support for migrating to a combined broker/controller KRaft deployment.

A new set of nodes will be provisioned to host the controller quorum. These controllers will be started with kafka.metadata.migration.enable set to “true”. Once the quorum is established and a leader is elected, the controller will check the state of the brokers by examining the broker registrations in ZK. If all brokers are ready for migration, the migration process will begin. The ZK data migration will copy the existing ZK data into the KRaft metadata log and establish the new KRaft active controller as the active controller from a ZK perspective.

While in migration mode, the KRaft controller will write to the metadata log as well as to ZooKeeper.

At this point, all of the brokers are running in ZK mode and their broker-controller communication channels operate as they would with a ZK controller. The ZK brokers will learn about this new controller by receiving an UpdateMetadataRequest from the new KRaft controller. From a broker’s perspective, the controller looks and behaves like a normal ZK controller.

The metadata migration process will cause controller downtime proportional to the total size of metadata in ZK. 

In order to ensure consistency of the metadata, we must stop making any writes to ZK while we are migrating the data. This is accomplished by forcing the new KRaft controller to be the active ZK controller by forcing a write to the "/controller" and "/controller_epoch" ZNodes.

Broker Migration

Following the migration of metadata and controller leadership to KRaft, the brokers are restarted one-by-one in KRaft mode. While this rolling restart is taking place, the cluster will be composed of both ZK and KRaft brokers. 

The broker migration phase does not cause downtime, but it is effectively unbounded in its total duration. 

There is likely no reasonable way to put a limit on how long a cluster stays in a mixed state since rolling restarts for large clusters may take several hours. It is also possible for the operator to revert back to ZK during this time.

Finalizing the Migration

Once the cluster has been fully upgraded to KRaft mode, the controller will still be running in migration mode and making dual writes to KRaft and ZK. Since the data in ZK is still consistent with that of the KRaft metadata log, it is still possible to revert back to ZK.

The time that the cluster is running all KRaft brokers/controllers, but still running in migration mode, is effectively unbounded.

Once the operator has decided to commit to KRaft mode, the final step is to restart the controller quorum and take it out of migration mode by setting kafka.metadata.migration.enable to "false" (or unsetting it). Once the controller leaves migration mode, it will no longer perform writes to ZK and it will disable its special ZK handling of ZK RPCs.

At this point, the cluster is fully migrated and is running in KRaft mode. A rollback to ZK is still possible after finalizing the migration, but it must be done offline and it will cause metadata loss (which can also cause partition data loss).

Implementation and Compatibility

Dual Metadata Writes

Metadata will be written to the KRaft metadata log as well as to ZooKeeper during the migration. This gives us two important guarantees: we have a safe path back to ZK mode and compatibility with ZK broker metadata that relies on ZK watches.

At any time during the migration, it should be possible for the operator to decide to revert back to ZK mode. This process should be safe and straightforward. By writing all metadata updates to both KRaft and ZK, we can ensure that the state stored in ZK is up-to-date.

By writing metadata changes to ZK, we also maintain compatibility with a few remaining direct ZK dependencies that exist on the ZK brokers. 

  • Broker Registration
  • ACLs
  • Dynamic Configs
  • Delegation Tokens

The ZK brokers still rely on the watch mechanism to learn about changes to these metadata. By performing dual writes, we cover these cases.

The controller will use a bounded write-behind approach for ZooKeeper updates. As we commit records to KRaft, we will asynchronously write data back to ZooKeeper. The number of pending ZK records will be reported as a metric so we can monitor how far behind the ZK state is from KRaft. We may also determine a bound on the number of records not yet written to ZooKeeper to avoid excessive difference between the KRaft and ZooKeeper states.

In order to ensure consistency of the data written back to ZooKeeper, we will leverage ZooKeeper multi-operation transactions. With each "multi" op sent to ZooKeeper, we will include the data being written (e.g., topics, configs, etc) along with a conditional update to the "/migration" ZNode. The contents of "/migration" will be updated with each write to include the offset of the latest record being written back to ZooKeeper. By using the conditional update, we can avoid races between KRaft controllers during a failover and ensure consistency between the metadata log and ZooKeeper.

Another benefit of using multi-operation transactions when synchronizing metadata to ZooKeeper is that we reduce the number of round-trips to ZooKeeper. This pipelining technique is also utilized by the ZK controller for performance reasons.

This dual write approach ensures that any metadata seen in ZK has also been committed to KRaft.

ZK Broker RPCs

In order to support brokers that are still running in ZK mode, the KRaft controller will need to send out additional RPCs to keep the metadata of the ZK brokers up-to-date. 

LeaderAndIsr: when the KRaft controller handles AlterPartitions or performs a leader election, we will need to send LeaderAndIsr requests to ZK brokers. 

UpdateMetadata: for metadata changes, the KRaft controller will need to send UpdateMetadataRequests to the ZK brokers. Instead of ControllerId, the KRaft controller will specify itself using KRaftControllerId field.

StopReplicas: following reassignments and topic deletions, we will need to send StopReplicas to ZK brokers for them to stop managing certain replicas. 

Controller Leadership

In order to prevent further writes to ZK, the first thing the new KRaft quorum must do is take over leadership of the ZK controller. This can be achieved by unconditionally overwriting two values in ZK. The "/controller" ZNode indicates the current active controller. By overwriting it, a watch will fire on all the ZK brokers to inform them of a new controller election. The active KRaft controller will write its node ID (e.g., 3000) into this ZNode to claim controller leadership. This write will be persistent rather than the usual ephemeral write used by the ZK controller election algorithm. This will ensure that no ZK broker can claim leadership during a KRaft controller failover.

The second ZNode we will write to is "/controller_epoch". This ZNode is used for fencing writes from old controllers in ZK mode. Each write from a ZK controller is actually a conditional multi-write with a "check" operation on the "/controller_epoch" ZNode's version. By altering this node, we can ensure any in-flight writes from the previous ZK controller epoch will fail.

Every time a KRaft controller election occurs, the newly elected controller will overwrite the values in “/controller” and “/controller_epoch”. The first epoch generated by the KRaft quroum must be greater than the last ZK epoch in order to maintain the monotonic epoch invariant.

And a state machine diagram:

Image Removed

Preparing the Cluster

The first step of the migration is to upgrade the cluster to at least the bridge release version. Upgrading the cluster to a well known starting point will reduce our compatibility matrix and ensure that the necessary logic is in place prior to the migration. As mentioned in the "Migration Trigger" section, we must also upgrade the cluster to the metadata.version introduced by this KIP. 

All brokers must be online before the migration can begin. This is necessary to ensure that all brokers satisfy the criteria for the migration. 

Controller Migration

A new set of nodes will be provisioned to host the controller quorum. These controllers will be started with kafka.metadata.migration.enable set to “true”. Once the quorum is established and a leader is elected, the controller will check the state of the cluster using the MigrationCheck RPC. If all brokers are properly configured, the migration process will begin. The ZK data migration will copy the existing ZK data into the KRaft metadata log and establish the new KRaft active controller as the active controller from a ZK perspective.

While in migration mode, the KRaft controller will write to the metadata log as well as to ZooKeeper.

At this point, all of the brokers are running in ZK mode and their broker-controller communication channels operate as they would with a ZK controller. The ZK brokers will learn about this new controller by receiving an UpdateMetadataRequest from the new KRaft controller. From a broker’s perspective, the controller looks and behaves like a normal ZK controller.

The metadata migration process will cause controller downtime proportional to the total size of metadata in ZK. 

In order to ensure consistency of the metadata, we must stop making any writes to ZK while we are migrating the data. This is accomplished by forcing the new KRaft controller to be the active ZK controller by forcing a write to the "/controller" and "/controller_epoch" ZNodes.

Broker Migration

Following the migration of metadata and controller leadership to KRaft, the brokers are restarted one-by-one in KRaft mode. While this rolling restart is taking place, the cluster will be composed of both ZK and KRaft brokers. 

The broker migration phase does not cause downtime, but it is effectively unbounded in its total duration. 

There is likely no reasonable way to put a limit on how long a cluster stays in a mixed state since rolling restarts for large clusters may take several hours. It is also possible for the operator to revert back to ZK during this time.

Finalizing the Migration

Once the cluster has been fully upgraded to KRaft mode, the controller will still be running in migration mode and making dual writes to KRaft and ZK. Since the data in ZK is still consistent with that of the KRaft metadata log, it is still possible to revert back to ZK.

The time that the cluster is running all KRaft brokers/controllers, but still running in migration mode, is effectively unbounded.

Once the operator has decided to commit to KRaft mode, the final step is to restart the controller quorum and take it out of migration mode by setting kafka.metadata.migration.enable to "false" (or unsetting it). Once the controller leaves migration mode, it will no longer perform writes to ZK and it will disable its special ZK handling of ZK RPCs.

At this point, the cluster is fully migrated and is running in KRaft mode. A rollback to ZK is still possible after finalizing the migration, but it must be done offline and it will cause metadata loss (which can also cause partition data loss).

Compatibility

Dual Metadata Writes

Metadata will be written to the KRaft metadata log as well as to ZooKeeper during the migration. This gives us two important guarantees: we have a safe path back to ZK mode and compatibility with ZK broker metadata that relies on ZK watches.

At any time during the migration, it should be possible for the operator to decide to revert back to ZK mode. This process should be safe and straightforward. By writing all metadata updates to both KRaft and ZK, we can ensure that the state stored in ZK is up-to-date.

By writing metadata changes to ZK, we also maintain compatibility with a few remaining direct ZK dependencies that exist on the ZK brokers. 

  • Broker Registration
  • ACLs
  • Dynamic Configs
  • Delegation Tokens

The ZK brokers still rely on the watch mechanism to learn about changes to these metadata. By performing dual writes, we cover these cases.

The controller will use a bounded write-behind approach for ZooKeeper updates. As we commit records to KRaft, we will asynchronously write data back to ZooKeeper. The number of pending ZK records will be reported as a metric so we can monitor how far behind the ZK state is from KRaft. We may also determine a bound on the number of records not yet written to ZooKeeper to avoid excessive difference between the KRaft and ZooKeeper states.

In order to ensure consistency of the data written back to ZooKeeper, we will leverage ZooKeeper multi-operation transactions. With each "multi" op sent to ZooKeeper, we will include the data being written (e.g., topics, configs, etc) along with a conditional update to the "/migration" ZNode. The contents of "/migration" will be updated with each write to include the offset of the latest record being written back to ZooKeeper. By using the conditional update, we can avoid races between KRaft controllers during a failover and ensure consistency between the metadata log and ZooKeeper.

This dual write approach ensures that any metadata seen in ZK has also been committed to KRaft.

ZK Broker RPCs

In order to support brokers that are still running in ZK mode, the KRaft controller will need to send out a few additional RPCs to keep things working in the broker. 

LeaderAndIsr: when the KRaft controller handles AlterPartitions or performs a leader election, we will need to send LeaderAndIsr requests to ZK brokers. 

UpdateMetadata: for certain metadata changes, the KRaft controller will need to send UpdateMetadataRequests to the ZK brokers. For the “ControllerId” field in this request, the controller should specify a random KRaft broker. Additionally, the controller must specify if a broker in “LiveBrokers” is KRaft or ZK.

StopReplicas: following reassignments and topic deletions, we will need to send StopReplicas to ZK brokers for them to stop managing certain replicas. 

Controller Leadership

In order to prevent further writes to ZK, the first thing the new KRaft quorum must do is take over leadership of the ZK controller. This can be achieved by unconditionally writing a value into the “/controller” and “/controller_epoch” ZNodes. The active KRaft controller will write its node ID (e.g., 3000) into the ZNode as a persistent value. By writing a persistent value (rather than ephemeral), we can prevent any ZK brokers from ever claiming controller leadership.

If a KRaft controller failover occurs, the new active controller will overwrite the values in “/controller” and “/controller_epoch”. 

Broker Registration

While running in migration mode, we must synchronize broker registration information bidirectionally between ZK and KRaft. 

...

Since we require controller forwarding for this KIP, we can use the KRaft approach of returning a random broker (ZK or KRaft) as the ControllerId for clients via MetadataResponse and rely on forwarding for write operations.

HoweverFor inter-broker requests such as AlterPartitions and ControlledShutdown, we do not want to add the overhead of forwarding for inter-broker requests such as AlterPartitions and ControlledShutdown. In the UpdateMetadataRequest sent by the KRaft controller to the ZK brokers, the ControllerId will point to the active controller which will be used for the inter-broker requestsso we'll want to include the actual controller in the UpdateMetadataRequest. However, we cannot simply include the KRaft controller as the ControllerId. The ZK brokers connect to a ZK controller by using the "inter.broker.listener.name" config and the node information from LiveBrokers in the UpdateMetadataRequest. For connecting to a KRaft controller, the ZK brokers will need to use the "controller.listener.names" and "controller.quorum.voters" configs. To allow this, we will use the new KRaftControllerId field in UpdateMetadataRequest.

Topic Deletions

The ZK migration logic will need to deal with asynchronous topic deletions when migrating data. Normally, the ZK controller will complete these asynchronous deletions via TopicDeletionManager. If the KRaft controller takes over before a deletion has occurred, we will need to complete the deletion as part of the ZK to KRaft state migration. Once the migration is complete, we will need to finalize the deletion in ZK so that the state is consistent.

Rollback to ZK

As mentioned above, it should be possible for the operator to rollback to ZooKeeper at any point in the migration process prior to taking the KRaft controllers out of migration mode. The procedure for rolling back is to reverse the steps of the migration that had been completed so far. 

  • Brokers should be restarted one by one in ZK mode
  • The KRaft controller quorum should be cleanly shutdown
  • Operator can remove the persistent "/controller" and "/controller_epoch" nodes allowing for ZK controller election to take place

A clean shutdown of the KRaft quorum is important because there may be uncommitted metadata waiting to be written to ZooKeeper. A forceful shutdown could let some metadata be lost, potentially leading to data loss.

Failure Modes

There are a few failure scenarios to consider during the migration. The KRaft controller can crash while initially copying the data from ZooKeeper, the controller can crash some time after the initial migration, and the controller can fail to write new metadata back to ZK.

For the initial migration, the controller will utilize KIP-868 Metadata Transactions to write all of the ZK metadata in a single transaction. If the controller fails before this transaction is finalized, the next active controller will abort the transaction and restart the migration process.

Once the data has been migrated and the cluster is the MigrationActive or MigrationFinished state, the KRaft controller may fail. If this happens, the Raft layer will elect a new leader which update the "/controller" and "/controller_epoch" ZNodes and take over the controller leadership as usual.

is complete, we will need to finalize the deletion in ZK so that the state is consistent.

Rollback to ZK

As mentioned above, it should be possible for the operator to rollback to ZooKeeper at any point in the migration process prior to taking the KRaft controllers out of migration mode. The procedure for rolling back is to reverse the steps of the migration that had been completed so far. 

  • Brokers should be restarted one by one in ZK mode
  • The KRaft controller quorum should be cleanly shutdown
  • Operator can remove the persistent "/controller" and "/controller_epoch" nodes allowing for ZK controller election to take place

A clean shutdown of the KRaft quorum is important because there may be uncommitted metadata waiting to be written to ZooKeeper. A forceful shutdown could let some metadata be lost, potentially leading to data loss.

Failure Modes

There are a few failure scenarios to consider during the migration. The KRaft controller can crash while initially copying the data from ZooKeeper, the controller can crash some time after the initial migration, and the controller can fail to write new metadata back to ZK.

Initial Data Migration

For the initial migration, the controller will utilize KIP-868 Metadata Transactions to write all of the ZK metadata in a single transaction. If the controller fails before this transaction is finalized, the next active controller will abort the transaction and restart the migration process.

Controller Crashes

Once the data has been migrated and the cluster is the MigrationActive or MigrationFinished state, the KRaft controller may fail. If this happens, the Raft layer will elect a new leader which update the "/controller" and "/controller_epoch" ZNodes and take over the controller leadership as usual.

Unavailable ZooKeeper

While in the dual-write mode, it is possible for a write to ZK to fail. In this case, we will want to stop making updates to the metadata log to avoid unbounded lag between KRaft and ZooKeeper. Since ZK brokers will be reading data like ACLs and dynamic configs from ZooKeeper, we should limit the amount of divergence between ZK and KRaft brokers by setting a bound on the amount of lag between KRaft and ZooKeeper.

Incompatible Brokers

At any time during the migration, it is possible for an operator to bring up an incompatible broker. This could be a new or existing broker. In this event, the KRaft controller will see the broker registration in ZK, but it will not send it any RPCs. By refusing to send it UpdateMetadata or LeaderAndIsr RPCs, this broker will be effectively fenced from the rest of the cluster. It is also possible for a write to ZK to fail. In this case, we will want to stop making updates to the metadata log to avoid unbounded lag between KRaft and ZooKeeper. Since ZK brokers will be reading data like ACLs and dynamic configs from ZooKeeper, we should limit the amount of divergence between ZK and KRaft brokers by setting a bound on the amount of lag between KRaft and ZooKeeper.

Test Plan

In addition to basic "happy path" tests, we will also want to test that the migration can tolerate failures of brokers and KRaft controllers. We will also want to have tests for the correctness of the system if ZooKeeper becomes unavailable during the migration. Another class of tests for this process is metadata consistency at the broker level. Since we are supporting ZK and KRaft brokers simultaneously, we need to ensure their metadata does not stay inconsistency for very long.

...