Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Add ApiVersionsResponse field, new metadata record, clarify several points from Jun's feedback

...

Code Block
{
  "apiKey": 6,
  "type": "request",
  "listeners": ["zkBroker"],
  "name": "UpdateMetadataRequest",
  "validVersions": "0-8",  <-- New version 8
  "flexibleVersions": "6+",
  "fields": [
    { "name": "ControllerId", "type": "int32", "versions": "0+", "entityType": "brokerId",
      "about": "The controller id." },
--> { "name": "KRaftControllerId", "type": "int32", "versions": "8+", "entityType": "brokerId",
      "about": "The KRaft controller id, used during migration." }, <-- New field
    { "name": "ControllerEpoch", "type": "int32", "versions": "0+",
      "about": "The controller epoch." },
    ...
   ]
}

Migration State ZNode

As part of the propagation of KRaft metadata back to ZooKeeper while in dual-write mode, we need to keep track of what has been synchronized. A new ZNode will be introduced to keep track of which KRaft record offset has been written back to ZK. This will be used to recover the synchronization state following a KRaft controller failover. 

ApiVersionsResponse

A new tagged field on ApiVersionsResponse will be added to allow KRaft controllers to indicate their ability to perform the migration

Code Block
{
  "apiKey": 18,
  "type": "response",
  "name": "ApiVersionsResponse"
Code Block
ZNode /migration

{
  "lastOffset": 100,
  "lastTimestampvalidVersions": "2022-01-01T00:00:00.000Z",0-4",   // <-- New version 4
  "kraftControllerIdflexibleVersions": 3000"3+",
  "kraftControllerEpochfields": 1
}

By using conditional updates on this ZNode, will can fence old KRaft controllers from synchronizing data to ZooKeeper if there has been a new election.

Controller ZNodes

The two controller ZNodes "/controller" and "/controller_epoch" will be managed by the KRaft quorum during the migration. More details in "Controller Leadership" section below.

Operational Changes

Forwarding Enabled on Brokers

As detailed in KIP-500 and KIP-590, all brokers (ZK and KRaft) must forward administrative requests such as CreateTopics to the active KRaft controller once the migration has started. When running the new metadata.version defined in this KIP, all brokers will enable forwarding.

Additional ZK Broker Configs 

To support connecting to a KRaft controller for requests such as AlterPartitions, the ZK brokers will need additional configs

...

[
    ...
    { "name": "ZkMigrationReady", "type": "int8", "versions": "4+", "taggedVersions": "4+", "tag": 3, "ignorable": true,
      "about": "Set by a KRaft controller if the required configurations for ZK migration are present" }
  ]
}


Migration Metadata Record

A new metadata record is added to indicate if a ZK migration has been started or finalized. 

Code Block
{
  "apiKey": <NEXT KEY>,
  "type": "metadata",
  "name": "MigrationRecord",
  "validVersions": "0",
  "flexibleVersions": "0+",
  "fields": [
    { "name": "MigrationState", "type": "int8", "versions": "0+",
      "about": "One of the possible migration states." },
  ]
}

The possible values for MigrationState are: Started (0) and Finalized (1). A int8 type is used to give the possibility of additional states in the future.

Migration State ZNode

As part of the propagation of KRaft metadata back to ZooKeeper while in dual-write mode, we need to keep track of what has been synchronized. A new ZNode will be introduced to keep track of which KRaft record offset has been written back to ZK. This will be used to recover the synchronization state following a KRaft controller failover. 

Code Block
ZNode /migration

{
  "version": 0,
  "last_update_time_ms": "2022-01-01T00:00:00.000Z",
  "kraft_controller_id": 3000,
  "kraft_controller_epoch": 1,
  "kraft_metadata_offset": 1234,
  "kraft_metadata_epoch": 10
}

By using conditional updates on this ZNode, will can fence old KRaft controllers from synchronizing data to ZooKeeper if there has been a new election.

Controller ZNodes

The two controller ZNodes "/controller" and "/controller_epoch" will be managed by the KRaft quorum during the migration. More details in "Controller Leadership" section below.

Operational Changes

Forwarding Enabled on Brokers

As detailed in KIP-500 and KIP-590, all brokers (ZK and KRaft) must forward administrative requests such as CreateTopics to the active KRaft controller once the migration has started. When running the new metadata.version defined in this KIP, all brokers will enable forwarding.

Additional ZK Broker Configs 

To support connecting to a KRaft controller for requests such as AlterPartitions, the ZK brokers will need additional configs

  • controller.quorum.voters: comma-separate list of "node@host:port" (the same as KRaft brokers would set)
  • controller.listener.names: a comma-separated list of listeners used by the controller
  • Corresponding entries in listener.security.property.map for the listeners given in controller.listener.names

Additional KRaft Broker Configs 

To support connecting to ZooKeeper during the migration, the KRaft controllers will need additional configs

  • zookeeper.connect (required)
  • zookeeper.connection.timeout.ms (optional)
  • zookeeper.session.timeout.ms (optional)
  • zookeeper.max.in.flight.requests (optional)
  • zookeeper.set.acl (optional)
  • ZooKeeper SSL configs (optional)

These configs should match the ZK configs in use by the ZK controller.

...

Migration Trigger

The migration from ZK to KRaft will be triggered by the cluster's state. To start a migration, the cluster must meet some requirements:

  1. Brokers have inter.broker.protocol.version set to the version added by this KIP to enable forwarding and indicate they are at the minimum software version
  2. Brokers have kafka.metadata.migration.enable set to “true”. This indicates an operator has declared some intention to start the migration.
  3. Brokers have the configs in "Additional ZK Broker Configs" set. This allows them to connect to the KRaft controller.
  4. No brokers are offline (we will use offline replicas as a proxy for this).

...

  1. The KRaft quorum is online and all members have kafka.metadata.migration.enable set to "true" as well as ZK configs set.

The operator can prepare the ZK brokers or KRaft controller in either order. The migration will only begin once every node is readyenable set to “true” to begin the migration.

By utilizing configs and broker/controller restarts, we follow a paradigm that Kafka operators are familiar with.

...

A new set of nodes will be provisioned to host the controller quorum. These controllers will be started with kafka.metadata.migration.enable set to “true”. Once the quorum is established and a leader is elected, the active controller will check the state of the brokers by examining the that the whole quorum is ready to begin the migration. This is done by examining the new tagged field on ApiVersionsResponse that is exchanged between controllers. Following this, the controller will examine the broker registrations in ZK. If all ZK brokers are ready for migration, the migration process will begin. The ZK data migration will

The first step in the migration is to copy the existing metadata from ZK data and write it into the KRaft metadata log and establish the new KRaft active controller as the active controller from a ZK perspective.While in migration mode, the KRaft controller will write to the metadata log as well as to ZooKeeper.. The active controller will also establish itself as the active controller from a ZK perspective. While copying the ZK data, the controller will not handle any RPCs from brokers. 

The metadata migration process will cause controller downtime proportional to the total size of metadata in ZK. 

The metadata copied from ZK will be encapsulated in a single metadata transaction (KIP-868). A MigrationRecord will also be included in this transaction. 

At this point, all of the brokers are running in ZK mode and their broker-controller communication channels operate as they would with a ZK controller. The ZK brokers will learn about this new controller by receiving an UpdateMetadataRequest from the new KRaft controller. From a broker’s perspective, the controller looks and behaves like a normal ZK controller. 

Metadata changes are now written to the KRaft metadata log as well as ZooKeeper. 

This dual-write mode will write metadata to both the KRaft metadata log and ZooKeeper.The metadata migration process will cause controller downtime proportional to the total size of metadata in ZK. 

In order to ensure consistency of the metadata, we must stop making any writes to ZK while we are migrating the data. This is accomplished by forcing the new KRaft controller to be the active ZK controller by forcing a write to the "/controller" and "/controller_epoch" ZNodes.

...

Once the operator has decided to commit to KRaft mode, the final step is to restart the controller quorum and take it out of migration mode by setting kafka.metadata.migration.enable to "false" (or unsetting it). The active controller will only finalize the migration once it detects that all members of the quorum have signaled that they are finalizing the migration (again, using the tagged field in ApiVersionsResponse). Once the controller leaves migration mode, it will write a MigrationRecord to the log and no longer perform writes to ZK and it . It will also disable its special ZK handling of ZK RPCs.

At this point, the cluster is fully migrated and is running in KRaft mode. A rollback to ZK is still possible after finalizing the migration, but it must be done offline and it will cause metadata loss (which can also cause partition data loss).

...

By writing metadata changes to ZK, we also maintain compatibility with a few remaining direct ZK dependencies that exist on the ZK brokers. 

  • Broker Registration
  • ACLs
  • Dynamic Configs
  • Delegation Tokens

...

While running in migration mode, we must synchronize broker registration information bidirectionally between from ZK and to KRaft. 

The KRaft controller will send UpdateMetadataRequests to ZK brokers to inform them of the other brokers in the cluster. This information is used by the brokers for the replication protocols. Similarly, the KRaft controller must know about ZK and KRaft brokers when performing operations like assignments and leader election.

...

In order to discover which ZK brokers exist, the KRaft controller will need to read the “/brokers” state from ZK and copy it into the metadata log. Inversely, as KRaft brokers register with the KRaft controller, we must write this data back to ZK to prevent ZK brokers from registering with the same node ID

If a ZK broker comes online and registers itself with a nodeId of an existing KRaft broker, we will log en error and fence the errant ZK broker by not sending it UpdateMetadataRequests.

If a KRaft broker attempts to register itself with a nodeId of an existing ZK broker, the controller will refuse the registration and the broker will terminate.

AdminClient, MetadataRequest, and Forwarding

...

The ZK migration logic will need to deal with asynchronous topic deletions when migrating data. Normally, the ZK controller will complete these asynchronous deletions via TopicDeletionManager. If the KRaft controller takes over before a deletion has occurred, we will need to complete the deletion as part of the ZK to KRaft state migration. Once the migration is complete, we will need to finalize the deletion in ZK so that the state is consistent.

Rollback to ZK

As mentioned above, it should be possible for the operator to rollback to ZooKeeper at any point in the migration process prior to taking the KRaft controllers out of migration mode. The procedure for rolling back is to reverse the steps of the migration that had been completed so far. 

  • Brokers should be restarted one by one in ZK

...

  • mode
  • The KRaft controller quorum should be cleanly shutdown
  • Operator can remove the persistent "/controller" and "/controller_epoch" nodes allowing for ZK controller election to take place

A clean shutdown of the KRaft quorum is important because there may be uncommitted metadata waiting to be written to ZooKeeper. A forceful shutdown could let some metadata be lost, potentially leading to data loss.

Failure Modes

There are a few failure scenarios to consider during the migration. The KRaft controller can crash while initially copying the data from ZooKeeper, the controller can crash some time after the initial migration, and the controller can fail to write new metadata back to ZK.

Initial Data Migration

For the initial migration, the controller will utilize KIP-868 Metadata Transactions to write all of the ZK metadata in a single transaction. If the controller fails before this transaction is finalized, the next active controller will abort the transaction and restart the migration process.

Controller Crashes

Once the data has been migrated and the cluster is the MigrationActive or MigrationFinished state, the KRaft controller may fail. If this happens, the Raft layer will elect a new leader which update the "/controller" and "/controller_epoch" ZNodes and take over the controller leadership as usual.

Unavailable ZooKeeper

While in the dual-write mode, it is possible for a write to ZK to fail. In this case, we will want to stop making updates to the metadata log to avoid unbounded lag between KRaft and ZooKeeper. Since ZK brokers will be reading data like ACLs and dynamic configs from ZooKeeper, we should limit the amount of divergence between ZK and KRaft brokers by setting a bound on the amount of lag between KRaft and ZooKeeper.

Incompatible Brokers

At any time during the migration, it is possible for an operator to bring up an incompatible broker. This could be a new or existing broker. In this event, the KRaft controller will see the broker registration in ZK, but it will not send it any RPCs. By refusing to send it UpdateMetadata or LeaderAndIsr RPCs, this broker will be effectively fenced from the rest of the cluster. 

Misconfigurations

A few misconfiguration scenarios exist which we can guard against.

If a migration has been started, but a KRaft controller is elected that is misconfigured (does not have kafka.metadata.migration.enable or ZK configs) this controller should resign. When replaying the metadata log during its initialization phase, this controller can see that a migration is in progress by seeing the initial MigrationRecord. Since it does not have the required configs, it can resign leadership and throw an error.

If a migration has been finalized, but the KRaft quroum comes up with kafka.metadata.migration.enable, we must not re-enter the migration mode. In this case, while replaying the log, the controller can see the second MigrationRecord and know that the migration is finalized and should not be resumed. This should result in errors being thrown, but the quorum can continue operating as normal.

Other scenarios likely exist and will be examined as the migration feature is implemented

As mentioned above, it should be possible for the operator to rollback to ZooKeeper at any point in the migration process prior to taking the KRaft controllers out of migration mode. The procedure for rolling back is to reverse the steps of the migration that had been completed so far. 

  • Brokers should be restarted one by one in ZK mode
  • The KRaft controller quorum should be cleanly shutdown
  • Operator can remove the persistent "/controller" and "/controller_epoch" nodes allowing for ZK controller election to take place

A clean shutdown of the KRaft quorum is important because there may be uncommitted metadata waiting to be written to ZooKeeper. A forceful shutdown could let some metadata be lost, potentially leading to data loss.

Failure Modes

There are a few failure scenarios to consider during the migration. The KRaft controller can crash while initially copying the data from ZooKeeper, the controller can crash some time after the initial migration, and the controller can fail to write new metadata back to ZK.

Initial Data Migration

For the initial migration, the controller will utilize KIP-868 Metadata Transactions to write all of the ZK metadata in a single transaction. If the controller fails before this transaction is finalized, the next active controller will abort the transaction and restart the migration process.

Controller Crashes

Once the data has been migrated and the cluster is the MigrationActive or MigrationFinished state, the KRaft controller may fail. If this happens, the Raft layer will elect a new leader which update the "/controller" and "/controller_epoch" ZNodes and take over the controller leadership as usual.

Unavailable ZooKeeper

While in the dual-write mode, it is possible for a write to ZK to fail. In this case, we will want to stop making updates to the metadata log to avoid unbounded lag between KRaft and ZooKeeper. Since ZK brokers will be reading data like ACLs and dynamic configs from ZooKeeper, we should limit the amount of divergence between ZK and KRaft brokers by setting a bound on the amount of lag between KRaft and ZooKeeper.

Incompatible Brokers

At any time during the migration, it is possible for an operator to bring up an incompatible broker. This could be a new or existing broker. In this event, the KRaft controller will see the broker registration in ZK, but it will not send it any RPCs. By refusing to send it UpdateMetadata or LeaderAndIsr RPCs, this broker will be effectively fenced from the rest of the cluster

Test Plan

In addition to basic "happy path" tests, we will also want to test that the migration can tolerate failures of brokers and KRaft controllers. We will also want to have tests for the correctness of the system if ZooKeeper becomes unavailable during the migration. Another class of tests for this process is metadata consistency at the broker level. Since we are supporting ZK and KRaft brokers simultaneously, we need to ensure their metadata does not stay inconsistency for very long.

...