Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The controller will use a bounded write-behind approach for ZooKeeper updates. As we commit records to KRaft, we will asynchronously write data back to ZooKeeper. The number of pending ZK records will be bounded reported as a metric so that we can avoid excessive lag monitor how far behind the ZK state is from KRaft. We may also determine a bound on the number of records not yet written to ZooKeeper to avoid excessive difference between the KRaft and ZooKeeper states.

This dual write approach ensures that any metadata seen in ZK will also be committed to KRaft.

ZK Broker RPCs

In order to support brokers that are still running in ZK mode, the KRaft controller will need to send out a few additional RPCs to keep things working in the broker. 

LeaderAndIsr: when the KRaft controller handles AlterPartitions or performs a leader election, we will need to send LeaderAndIsr requests to ZK brokers. 

UpdateMetadata: for certain metadata changes, the KRaft controller will need to send UpdateMetadataRequests to the ZK brokers. For the “ControllerId” field in this request, the controller should specify a random KRaft broker. Additionally, the controller must specify if a broker in “LiveBrokers” is KRaft or ZK.

StopReplicas: following reassignments and topic deletions, we will need to send StopReplicas to ZK brokers for them to stop managing certain replicas. 

Controller Leadership

In order to prevent further writes to ZK, the first thing the new KRaft quorum must do is take over leadership of the ZK controller. This can be achieved by unconditionally writing a value into the “/controller” and “/controller_epoch” ZNodes. The active KRaft controller will write its node ID (e.g., 3000) into the ZNode as a persistent value. By writing a persistent value (rather than ephemeral), we can prevent any ZK brokers from ever claiming controller leadership.

If a KRaft controller failover occurs, the new active controller will overwrite the values in “/controller” and “/controller_epoch”. 

Broker Registration

While running in migration mode, we must synchronize broker registration information bidirectionally between ZK and KRaft. 

The KRaft controller will send UpdateMetadataRequests to ZK brokers to inform them of the other brokers in the cluster. This information is used by the brokers for the replication protocols. Similarly, the KRaft controller must know about ZK and KRaft brokers when performing operations like assignments and leader election.

ZK brokers, KRaft brokers, and the KRaft controller must know about all brokers in the cluster.

In order to discover which ZK brokers exist, the KRaft controller will need to read the “/brokers” state from ZK and copy it into the metadata log. Inversely, as KRaft brokers register with the KRaft controller, we must write this data back to ZK to prevent ZK brokers from registering with the same node ID.

AdminClient, MetadataRequest, and Forwarding

When a client bootstraps metadata from the cluster, it must receive the same metadata regardless of the type of broker it is bootstrapping from. Normally, ZK brokers return the active ZK controller as the ControllerId and KRaft brokers return a random alive KRaft broker. In both cases, this ControllerId is internally read from the MetadataCache on the broker.

Since we require controller forwarding for this KIP, we can use the KRaft approach of returning a random broker (ZK or KRaft) as the ControllerId and rely on forwarding for write operations.

However, we do not want to add the overhead of forwarding for inter-broker requests such as AlterPartitions and ControlledShutdown. In the UpdateMetadataRequest sent by the KRaft controller to the ZK brokers, the ControllerId will point to the active controller which will be used for the inter-broker requests.

Topic Deletions

The ZK migration logic will need to deal with asynchronous topic deletions when migrating data. Normally, the ZK controller will complete these asynchronous deletions via TopicDeletionManager. If the KRaft controller takes over before a deletion has occurred, we will need to complete the deletion as part of the ZK to KRaft state migration. Once the migration is complete, we will need to finalize the deletion in ZK so that the state is consistent.

Failure Modes

There are a few failure scenarios to consider during the migration. The KRaft controller can crash while initially copying the data from ZooKeeper, the controller can crash some time after the initial migration, and the controller can fail to write new metadata back to ZK.

For the initial migration, the controller will utilize KIP-868 Metadata Transactions to write all of the ZK metadata in a single transaction. If the controller fails before this transaction is finalized, the next active controller will abort the transaction and restart the migration process.

Once the data has been migrated and the cluster is the MigrationActive or MigrationFinished state, the KRaft controller may fail. If this happens, the Raft layer will elect a new leader which update the "/controller" and "/controller_epoch" ZNodes and take over the controller leadership as usual.

It is also possible for a write to ZK to fail. In this case, 

Test Plan

Describe in few sentences how the KIP will be tested. We are mostly interested in system tests (since unit-tests are specific to implementation details). How will we know that the implementation works as expected? How will we know nothing broke?

Rejected Alternatives

Offline Migration

The main alternative to this design is to do an offline migration. While this would be much simpler, it would be a non-starter for many Kafka users who require minimal downtime of their cluster. By allowing for an online migration from ZK to KRaft, we can provide a path towards KRaft for all Kafka users – even ones where Kafka is critical infrastructure. 

No Dual Writes

Another simplifying alternative would be to only write metadata into KRaft while in the migration mode. This has a few disadvantages. Primarily, it makes rolling back to ZK much more difficult, it at all possible. Secondly, we actually have a few remaining ZK read usages on the brokers that need the data in ZK to be up-to-date (see above section on Dual Metadata Writes). 

Command/RPC based trigger

Another way to start the migration would be to have an operator issue a special command or send a special RPC. Adding human-driven manual steps like this to the migration may make it more difficult to integrate with orchestration software such as Anisble, Chef, Kubernetes, etc. By sticking with a "config and reboot" approach, the migration trigger is still simple, but easier to integrate into other control systems.

Write-ahead ZooKeeper data synchronization

In order to ensure consistency of the data written back to ZooKeeper, we will leverage ZooKeeper multi-operation transactions. With each "multi" op sent to ZooKeeper, we will include the data being written (e.g., topics, configs, etc) along with a conditional update to the "/migration" ZNode. The contents of "/migration" will be updated with each write to include the offset of the latest record being written back to ZooKeeper. By using the conditional update, we can avoid races between KRaft controllers during a failover and ensure consistency between the metadata log and ZooKeeper.

This dual write approach ensures that any metadata seen in ZK has also been committed to KRaft.

ZK Broker RPCs

In order to support brokers that are still running in ZK mode, the KRaft controller will need to send out a few additional RPCs to keep things working in the broker. 

LeaderAndIsr: when the KRaft controller handles AlterPartitions or performs a leader election, we will need to send LeaderAndIsr requests to ZK brokers. 

UpdateMetadata: for certain metadata changes, the KRaft controller will need to send UpdateMetadataRequests to the ZK brokers. For the “ControllerId” field in this request, the controller should specify a random KRaft broker. Additionally, the controller must specify if a broker in “LiveBrokers” is KRaft or ZK.

StopReplicas: following reassignments and topic deletions, we will need to send StopReplicas to ZK brokers for them to stop managing certain replicas. 

Controller Leadership

In order to prevent further writes to ZK, the first thing the new KRaft quorum must do is take over leadership of the ZK controller. This can be achieved by unconditionally writing a value into the “/controller” and “/controller_epoch” ZNodes. The active KRaft controller will write its node ID (e.g., 3000) into the ZNode as a persistent value. By writing a persistent value (rather than ephemeral), we can prevent any ZK brokers from ever claiming controller leadership.

If a KRaft controller failover occurs, the new active controller will overwrite the values in “/controller” and “/controller_epoch”. 

Broker Registration

While running in migration mode, we must synchronize broker registration information bidirectionally between ZK and KRaft. 

The KRaft controller will send UpdateMetadataRequests to ZK brokers to inform them of the other brokers in the cluster. This information is used by the brokers for the replication protocols. Similarly, the KRaft controller must know about ZK and KRaft brokers when performing operations like assignments and leader election.

ZK brokers, KRaft brokers, and the KRaft controller must know about all brokers in the cluster.

In order to discover which ZK brokers exist, the KRaft controller will need to read the “/brokers” state from ZK and copy it into the metadata log. Inversely, as KRaft brokers register with the KRaft controller, we must write this data back to ZK to prevent ZK brokers from registering with the same node ID.

AdminClient, MetadataRequest, and Forwarding

When a client bootstraps metadata from the cluster, it must receive the same metadata regardless of the type of broker it is bootstrapping from. Normally, ZK brokers return the active ZK controller as the ControllerId and KRaft brokers return a random alive KRaft broker. In both cases, this ControllerId is internally read from the MetadataCache on the broker.

Since we require controller forwarding for this KIP, we can use the KRaft approach of returning a random broker (ZK or KRaft) as the ControllerId and rely on forwarding for write operations.

However, we do not want to add the overhead of forwarding for inter-broker requests such as AlterPartitions and ControlledShutdown. In the UpdateMetadataRequest sent by the KRaft controller to the ZK brokers, the ControllerId will point to the active controller which will be used for the inter-broker requests.

Topic Deletions

The ZK migration logic will need to deal with asynchronous topic deletions when migrating data. Normally, the ZK controller will complete these asynchronous deletions via TopicDeletionManager. If the KRaft controller takes over before a deletion has occurred, we will need to complete the deletion as part of the ZK to KRaft state migration. Once the migration is complete, we will need to finalize the deletion in ZK so that the state is consistent.

Failure Modes

There are a few failure scenarios to consider during the migration. The KRaft controller can crash while initially copying the data from ZooKeeper, the controller can crash some time after the initial migration, and the controller can fail to write new metadata back to ZK.

For the initial migration, the controller will utilize KIP-868 Metadata Transactions to write all of the ZK metadata in a single transaction. If the controller fails before this transaction is finalized, the next active controller will abort the transaction and restart the migration process.

Once the data has been migrated and the cluster is the MigrationActive or MigrationFinished state, the KRaft controller may fail. If this happens, the Raft layer will elect a new leader which update the "/controller" and "/controller_epoch" ZNodes and take over the controller leadership as usual.

It is also possible for a write to ZK to fail. In this case, 



Test Plan

Describe in few sentences how the KIP will be tested. We are mostly interested in system tests (since unit-tests are specific to implementation details). How will we know that the implementation works as expected? How will we know nothing broke?

Rejected Alternatives

Offline Migration

The main alternative to this design is to do an offline migration. While this would be much simpler, it would be a non-starter for many Kafka users who require minimal downtime of their cluster. By allowing for an online migration from ZK to KRaft, we can provide a path towards KRaft for all Kafka users – even ones where Kafka is critical infrastructure. 

No Dual Writes

Another simplifying alternative would be to only write metadata into KRaft while in the migration mode. This has a few disadvantages. Primarily, it makes rolling back to ZK much more difficult, it at all possible. Secondly, we actually have a few remaining ZK read usages on the brokers that need the data in ZK to be up-to-date (see above section on Dual Metadata Writes). 

Command/RPC based trigger

Another way to start the migration would be to have an operator issue a special command or send a special RPC. Adding human-driven manual steps like this to the migration may make it more difficult to integrate with orchestration software such as Anisble, Chef, Kubernetes, etc. By sticking with a "config and reboot" approach, the migration trigger is still simple, but easier to integrate into other control systems.

Write-ahead ZooKeeper data synchronization

An alternative to write-behind for ZooKeeper would be to write first to ZooKeeper and then write to the metadata log. The main problem with this approach is that it will make KRaft writes much slower since ZK will always be in the write path. By doing a write-behind with offset tracking, we can amortize the ZK write latency and possibly be more efficient about making bulk writes to ZK.TODO