Status

Current state: Under Discussion

Discussion thread: here

JIRA: Unable to render Jira issues macro, execution error.

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

As part of the KIP-500 initiative, we need to build a bridge release version of Kafka that could isolate the direct Zookeeper write access only to the controller. Protocols that alter cluster/topic configurations, security configurations or quotas, topics etc, should be migrated for sure as they are still relying on arbitrary broker to Zookeeper write access.

Take config change protocol for example. The current AlterConfig request propagation path is:

The admin client issues an (Incremental)AlterConfig request to broker
Broker updates the zookeeper path storing the metadata
If #2 successful, returns to the client
All brokers refresh their metadata upon ZK notification

Here we use ZK as the persistent storage for all the config changes, and even some brokers are not able to get in sync with ZK due to transient failures, a successful update shall be eventually guaranteed. In this KIP we would like to maintain the same level of guarantee, and make the controller as the single writer to modify the config metadata in ZK.

Proposed Changes

Take AlterConfig as an example to understand the change we are making.

Change AlterConfig Request Routing

During the bridge release, the propagation path shall be revised as:

The admin client issues an (Incremental)AlterConfig request to the controller
The controller updates the config, and stores it in ZK
If #3 successful, returns to the client
ZK update will be propagated towards all affected brokers in the cluster

This simple routing change makes sure only controller needs to write to ZK, while other broker shall just wait for the metadata update from ZK notification eventually. As we have the source of truth configs stored in ZK still, any re-election of controller shall be safe.

The above path works with the new admin client we are going to introduce with next release. For older clients, they are not expected to forward the request to controller, so brokers should support the proxy of requests, with a revised update path:

The old admin client issues an (Incremental)AlterConfig request to a random broker
The broker redirects the request to the controller
The controller updates the config, and store it in ZK
If #3 successful, returns to the proxy broker
The proxy broker returns to the client as success
ZK update will be propagated towards all affected brokers in the cluster

This whole update strategy change would be applied to all the direct ZK mutation paths, including:

```
AlterConfig
```
```
IncrementalAlterConfig 
```
```
CreateAcls
```
```
DeleteAcls
```
```
AlterClientQuotas
```
```
CreateDelegationToken
```
```
RenewDelegationToken 
```
```
ExpireDelegationToken
```

Partially Change FindCoordinator Request Routing

One edge case we also would like to fix is for the FindCoordinator protocol. It has a special internal topic creation logic when the cluster receives the request for the first time as transaction log topic and consumer offset topic are lazily initialized. Currently the target broker shall just utilize its own ZK client to create topic, which is disallowed in the bridge release. For this scenario, non-controller broker shall just forward a CreateTopicRequest to the controller instead and let controller take care of the rest, while waiting for the response in the mean time.

Public Interfaces

We are going to bump all mentioned APIs above by one version, and new admin client was expected to only talk to the controller. For example we bump the AlterConfig API to v2.

AlterConfigRequest.json

{
  "apiKey": 44,
  "type": "request",
  "name": "IncrementalAlterConfigsRequest",
  // Version 1 is the first flexible version. For new binary deploy, this should always be forwarded to the controller.
  //
  // Version 2 the request shall always route to the controller.
  "validVersions": "0-2",
  "flexibleVersions": "1+",
   "fields": [
    { "name": "Resources", "type": "[]AlterConfigsResource", "versions": "0+",
      "about": "The incremental updates for each resource.", "fields": [
      { "name": "ResourceType", "type": "int8", "versions": "0+", "mapKey": true,
        "about": "The resource type." },
      { "name": "ResourceName", "type": "string", "versions": "0+", "mapKey": true,
        "about": "The resource name." },
      { "name": "Configs", "type": "[]AlterableConfig", "versions": "0+",
        "about": "The configurations.",  "fields": [
        { "name": "Name", "type": "string", "versions": "0+", "mapKey": true,
          "about": "The configuration key name." },
        { "name": "ConfigOperation", "type": "int8", "versions": "0+", "mapKey": true,
          "about": "The type (Set, Delete, Append, Subtract) of operation." },
        { "name": "Value", "type": "string", "versions": "0+", "nullableVersions": "0+",
          "about": "The value to set for the configuration key."}
      ]}
    ]},
}

If the request is v1, broker with new AlterConfig protocol enabled should always proxy the request to the controller by sending a v2 request. If the request is v2, the recipient broker should handle it or respond a NOT_CONTROLLER to ask for rediscovery if it is not the controller.

This effectively means the request shall never be forwarded twice after bumping from v1 to v2, which avoids the edge scenario for infinite redirection during unstable controller elections.

Same applies for all other mentioned requests as well. The new version request should always go to the controller. If the request is on an older version, broker shall redirect it to the controller.

```
AlterConfig to v2
```
```
IncrementalAlterConfig to v2
```
```
CreateAcls to v3
```
```
DeleteAcls to v3
```
```
AlterClientQuotas to v1
```
```
CreateDelegationToken to v3
```
```
RenewDelegationToken to v3
```
```
ExpireDelegationToken to v3
```

Compatibility, Deprecation, and Migration Plan

The upgrade path shall be guarded by the inter.broker.protocol(IBP) to make sure the routing behavior is consistent. After first rolling bounce to upgrade the binary version, all fellow brokers are still handling ZK mutation requests by themselves. With the second IBP bump rolling bounce, all upgraded brokers will be using the new routing algorithm effectively described in this KIP.

As we discussed in the request routing section, to work with an older client, the first contacted broker need to act as a proxy to redirect the write request to the controller. To support the proxy of requests, we need to build a channel for brokers to talk directly to the controller. This part of the design is internal change only and won’t block the KIP progress.

Rejected Alternatives

We discussed about the possibility of immediately building a metadata topic to propagate the changes. This seems aligned with the eventual metadata quorum path, but at a cost of blocking the current API migration towards the bridge release, since the metadata quorum design is much more complicated and requires more iterations. To avoid this extra dependency on other tracks, we should go ahead and migrate existing protocols to meet the bridge release goal sooner.

Future Works

We have also discussed about migrating the metadata read path to controller-only for read-after-write consistency. This sounds like a nice improvement but needs more discussions on trade-offs between overloading controller and the metadata consistency, also the progress of Raft quorum design as well.

Space shortcuts

Child pages

Status

Motivation

Proposed Changes

Change AlterConfig Request Routing

Partially Change FindCoordinator Request Routing

Public Interfaces

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Future Works

Space shortcuts

Child pages

KIP-590: Redirect Zookeeper Mutation Protocols to The Controller

Status

Motivation

Proposed Changes

Change AlterConfig Request Routing

Partially Change FindCoordinator Request Routing

Public Interfaces

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Future Works