Status
Current state: Draft
Discussion thread:
JIRA:
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
To complete the plan for KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum, we need a way to migrate Kafka clusters from a ZooKeeper quorum to a KRaft quorum. This must be done without impact to partition availability and with minimal impact to operators and client applications.
In order to give users more confidence about undertaking the migration to KRaft, we will allow a rollback to ZooKeeper until the final step of the migration. This is accomplished by writing two copies of the metadata during the migration – one to the KRaft quorum, and one to ZooKeeper.
This KIP defines the behavior and set of new APIs for the “bridge release” as first mentioned in KIP-500.
Public Interfaces
New metadata.version (IBP)
A new metadata.version will be used for a few things in this design.
- Gate the usage of a new MigrationCheck RPC
- Allow the migration to begin
- Enable forwarding on all brokers (KIP-590: Redirect Zookeeper Mutation Protocols to The Controller)
All brokers must be running this metadata.version before the migration can begin.
Migration-mode configuration
A new “kafka.metadata.migration.enable” config will be added for the broker and controller. Its default will be “false”. Setting this config to “true” on the brokers is a prerequisite to starting the migration. Setting this to "true" on the KRaft controllers is the trigger for starting the migration (more on that below).
MigrationCheck RPC
Brokers will use the new metadata.version to enable a new MigrationCheck RPC. This RPC will be used by the KRaft controller to determine if the cluster is ready to be migrated. The response will include the cluster ID and a boolean indicating if the migration mode config has been enabled statically on this broker.
The purpose of this RPC is to signal that a broker is able to be migrated. When the KRaft controller begins the migration process, it will first check that the live brokers are able to be migrated.
Request:
{ "apiKey": TBD, "type": "request", "name": "MigrationCheckRequest", "validVersions": "0", "flexibleVersions": "0+", "fields": [ ] }
Response:
{ "apiKey": TBD, "type": "response", "name": "MigrationCheckResponse", "validVersions": "0", "flexibleVersions": "0+", "fields": [ {"name": "clusterId": "type": "uuid", "versions": "0+"}, {"name": "configEnabled": "type": "boolean", "versions": "0+"} ] }
Migration State ZNode
As part of the propagation of KRaft metadata back to ZooKeeper while in dual-write mode, we need to keep track of what has been synchronized. A new ZNode will be introduced to keep track of which KRaft record offset has been written back to ZK. This will be used to recover the synchronization state following a KRaft controller failover.
ZNode /migration { "lastOffset": 100, "lastTimestamp": "2022-01-01T00:00:00.000Z", "kraftControllerId": 3000, "kraftControllerEpoch": 1 }
Controller ZNodes
The two controller ZNodes "/controller" and "/controller_epoch" will be managed by the KRaft quorum during the migration. Rather than using ephemeral ZNodes, the KRaft controller will use a persistent ZNode for "/controller" to prevent ZK brokers from attempting to become the active controller. The "/controller_epoch" ZNode will be managed by the active KRaft controller and incremented anytime a new KRaft controller is elected.
Operational Changes
Forwarding Enabled on Brokers
As detailed in KIP-500 and KIP-590, all brokers (ZK and KRaft) must forward administrative requests such as CreateTopics to the active KRaft controller once the migration has started. When running the new metadata.version defined in this KIP, all brokers will enable forwarding.
Migration Trigger
The migration from ZK to KRaft will be triggered by the cluster's state. To start a migration, the cluster must meet two requirements:
- The metadata.version is set to the version added by this KIP. This indicates the software is at a minimum version which includes the necessary logic to perform the migration
- All ZK brokers have kafka.metadata.migration.enable set to “true”. This indicates an operator has declared some intention to start the migration
Once these conditions are satisfied, an operator can start a KRaft quorum with kafka.metadata.migration.enable set to “true” to begin the migration.
By utilizing configs and broker/controller restarts, we follow a paradigm that Kafka operators are familiar with.
Migration Overview
Here is a state machine description of the migration.
State | Description |
ZooKeeperMode | The cluster is in ZooKeeper mode |
MigrationEligible | The cluster has been upgraded to a minimum software version and has set the necessary static configs |
MigrationReady | The KRaft quorum has been started |
MigrationActive | ZK state has been migrated, controller is in dual-write mode, brokers are being restarted in KRaft mode |
MigrationFinished | All of the brokers have been restarted in KRaft mode, controller still in dual-write mode |
KRaftMode | The cluster is in KRaft mode |
And a state machine diagram:
Compatibility, Deprecation, and Migration Plan
- What impact (if any) will there be on existing users?
- If we are changing behavior how will we phase out the older behavior?
- If we need special migration tools, describe them here.
- When will we remove the existing behavior?
Test Plan
Describe in few sentences how the KIP will be tested. We are mostly interested in system tests (since unit-tests are specific to implementation details). How will we know that the implementation works as expected? How will we know nothing broke?
Rejected Alternatives
If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.