EnvelopeRequest handling
Master KIP
KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum (Accepted)
Status
Current state: Under Discussion
Discussion thread: here
JIRA:
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
As part of the KIP-500 initiative, we need to build a bridge release version of Kafka that could isolate the direct Zookeeper write access only to the controller. Protocols that alter cluster/topic configurations, security configurations or quotas, topics etc, should be migrated for sure as they are still relying on arbitrary broker to Zookeeper write access.
Take config change protocol for example. The current AlterConfig request propagation path is:
The admin client issues an (Incremental)AlterConfig request to broker
Broker updates the zookeeper path storing the metadata
If #2 successful, returns to the client
All brokers refresh their metadata upon ZK notification
Here we use ZK as the persistent storage for all the config changes, and even some brokers are not able to get in sync with ZK due to transient failures, a successful update shall be eventually guaranteed. In this KIP we would like to maintain the same level of guarantee, and make the controller as the single writer to modify the config metadata in ZK.
Proposed Changes
Take AlterConfig as an example to understand the changes we are making.
Change AlterConfig Request Routing
During the bridge release, the propagation path shall be revised as:
The admin client issues an (Incremental)AlterConfig request to the controller
The controller updates the config, and stores it in ZK
If #3 successful, returns to the client
ZK update will be propagated towards all affected brokers in the cluster
This simple routing change makes sure only controller needs to write to ZK, while other broker shall just wait for the metadata update from ZK notification eventually. As we have the source of truth configs stored in ZK still, any re-election of controller shall be safe.
The above path works with the new admin client we are going to introduce with next release. For older clients, they are not expected to forward the request to controller, so brokers should support the proxy of requests, with a revised update path:
The old admin client issues an (Incremental)AlterConfig request to a random broker
The broker redirects the request to the controller
The controller updates the config, and store it in ZK
If #3 successful, returns to the proxy broker
The proxy broker returns to the client as success
ZK update will be propagated towards all affected brokers in the cluster
This whole update strategy change would be applied to all the direct ZK mutation paths, including:
AlterConfig
IncrementalAlterConfig
CreateAcls
DeleteAcls
AlterClientQuotas
CreateDelegationToken
RenewDelegationToken
ExpireDelegationToken
Internal CreateTopicsRequest Routing
Certain edge cases we would also like to fix is for the internal topic creation.
- FindCoordinator protocol has an internal topic creation logic when the cluster receives the request for the first time as transaction log topic and consumer offset topic are lazily initialized.
- MetadataRequest protocol also contains an internal topic creation logic when we are looking for metadata for a non-existing internal topic and auto-topic-creation is enabled.
Currently the target broker shall just utilize its own ZK client to create internal topics, which is disallowed in the bridge release. For above scenarios, non-controller broker shall just forward a CreateTopicRequest to the controller instead and let controller take care of the rest, while waiting for the response in the mean time.
One thing to note that at the moment the direct ZK access bypasses the CreateTopicPolicy. To maintain the same guarantee, we would consider adding a new field in the RPC to flag this bypass logic.
Routing Request Security
For older requests that need redirection, we shall create a new RPC called `Envelope` to embed the actual request. This request will be fully wrapping an older version client request, including its header, security information, actual data fields, etc. The request requires ClusterAction on CLUSTER.
Public Interfaces
Protocol Bumps
We are going to bump all mentioned mutation APIs above by one version, and new admin client was expected to only talk to the controller. For example we bump the AlterConfig API to v2.
{ "apiKey": 44, "type": "request", "name": "IncrementalAlterConfigsRequest", // Version 1 is the first flexible version. For new binary deploy, this should always be forwarded to the controller. // // Version 2 the request shall always route to the controller. "validVersions": "0-2", "flexibleVersions": "1+", "fields": [ { "name": "Resources", "type": "[]AlterConfigsResource", "versions": "0+", "about": "The incremental updates for each resource.", "fields": [ { "name": "ResourceType", "type": "int8", "versions": "0+", "mapKey": true, "about": "The resource type." }, { "name": "ResourceName", "type": "string", "versions": "0+", "mapKey": true, "about": "The resource name." }, { "name": "Configs", "type": "[]AlterableConfig", "versions": "0+", "about": "The configurations.", "fields": [ { "name": "Name", "type": "string", "versions": "0+", "mapKey": true, "about": "The configuration key name." }, { "name": "ConfigOperation", "type": "int8", "versions": "0+", "mapKey": true, "about": "The type (Set, Delete, Append, Subtract) of operation." }, { "name": "Value", "type": "string", "versions": "0+", "nullableVersions": "0+", "about": "The value to set for the configuration key."} ]} ]}, }
If the request is v1, broker with new AlterConfig protocol enabled should always proxy the request to the controller. If the request is v2, the recipient broker should handle it or respond a NOT_CONTROLLER to ask for rediscovery if it is not the controller.
Same applies for all other mentioned requests as well. The new version request should always go to the controller. If the request is on an older version, broker shall redirect it to the controller.
AlterConfig to v2
IncrementalAlterConfig to v2
CreateAcls to v3
DeleteAcls to v3
AlterClientQuotas to v1
CreateDelegationToken to v3
RenewDelegationToken to v3
ExpireDelegationToken to v3
The CreateTopic routing change is purely inter-broker. Since the CreateTopicRequest is already handled by controller only, so no change on this side.
New Envelope RPC
We are also going to add a new RPC type to wrap the original request during the forwarding. We will make corresponding changes to `ApiMessageTypeGenerator` class to recognize the new field `Header` and `ApiMessage` during the auto generation. And for authentication and audit logging purpose, we proposed to add the following fields:
- Serialized Principal information
- Client host ip address
- Listener name
- Security protocol being used
{ "apiKey": N, "type": "request", "name": "EnvelopeRequest", "validVersions": "0", "flexibleVersions": "0+", "fields": [ { "name": "RequestHeader", "type": "Header", "versions": "0+", "about": "The embedded request header." }, { "name": "RequestData", "type": "ApiMessage", "versions": "0+", "about": "The embedded request data."}, { "name": "PrincipalInfo", "type": "bytes", "versions": "0+", "about": "The serialized principal information."}, { "name": "ClientHostIP", "type": "string", "versions": "0+"}, { "name": "ListenerName", "type": "string", "versions": "0+"}, { "name": "SecurityProtocol", "type": "string", "versions": "0+"} ] }
EnvelopeRequest Handling
When receiving an EnvelopeRequest, the broker shall authorize the request with forwarding broker's principal. If the outer request is verified, the broker will continue to unwrap the inner request and handle it as normal, which means it would continue performing authorization for the inner layer principal. For KIP-590 scope, the possible top error codes are:
- NOT_CONTROLLER as we are only forwarding admin write requests.
- CLUSTER_AUTHORIZATION_FAILED if the inter-broker verification failed.
The CLUSTER authorization for EnvelopeRequest takes place during the request handling, similar to LeaderAndIsrRequest. This ensures the EnvelopeRequest is not sent from a malicious client pretending to be a fellow broker. For inner request error, it will still be embedded inside the `ResponseData` struct defined in EnvelopeResponse below.
{ // Possible top level error code: // // NOT_CONTROLLER // CLUSTER_AUTHORIZATION_FAILED // "apiKey": N, "type": "response", "name": "EnvelopeResponse", "validVersions": "0", "flexibleVersions": "0+", "fields": [ { "name": "ResponseHeader", "type": "Header", "versions": "0+", "about": "The embedded response header." }, { "name": "ResponseData", "type": "ApiMessage", "versions": "0+", "about": "The embedded response data."}, { "name": "ErrorCode", "type": "int16", "versions": "0+", "about": "The error code, or 0 if there was no error." }, ] }
EnvelopeResponse Handling
When the response contains NOT_CONTROLLER error code, the forwarding broker will keep finding the correct controller until request eventually times out. For CLUSTER_AUTHORIZATION_FAILED, this indicates an internal error for broker security setup which has nothing to do with the client, so we have no other way but returning an UNKNOWN_SERVER_ERROR to the admin client.
For whatever result the controller replies to the inner request, the forwarding broker won't check. As long as the top level has no error, the forwarding broker will claim the request to be successful and reply the inner response to the admin client for the rest of error handling.
KafkaPrincipal Serialization
We shall also bring in a backward-incompatible change to add serializability to the KafkaPrincipalBuilder class:
public interface KafkaPrincipalBuilder { ... ByteBuffer serialize(); KafkaPrincipal deserialize(ByteBuffer serializedPrincipal); }
which requires a non-default implementation to ensure the principal information gets properly compacted into the forwarding request.
Monitoring Metrics
To effectively monitor the admin request forwarding status, we would add two metrics: num-client-forwarding-to-controller-rate and num-client-fowarding-to-controller-count to visualize the progress. They are set up to track the number of clients being on a pending forward request for each broker.
In terms of the usage for Envelope RPC, we could expect that not all forwarding requests are necessarily due to the old admin client or even to the controller. To better track the forwarding load, we shall also monitor the total message being redirected as metrics num-messages-redirected-rate and num-messages-redirected-count.
We could add an alerting metrics called request-forwarding-to-controller-authorization-fail-count in an effort to help administrator detect wrong security setup sooner. There should already be metrics monitoring request failures, so this metric could be optional.
In the long term after bridge release, we could potentially perform some incompatible changes to the Raft Quorum. To monitor the old client connections to better capture the timing for a major version bump, KIP-511 also has already exposed metrics like an "unknown" software name and an "unknown" software version.
Compatibility, Deprecation, and Migration Plan
The upgrade path shall be guarded by the inter.broker.protocol (IBP) to make sure the routing behavior is consistent. After first rolling bounce to upgrade the binary version, all fellow brokers are still handling ZK mutation requests by themselves. With the second IBP bump rolling bounce, all upgraded brokers will be using the new routing algorithm effectively described in this KIP.
As we discussed in the request routing section, to work with an older client, the first contacted broker need to act as a proxy to redirect the write request to the controller. To support the proxy of requests, we need to build a channel for brokers to talk directly to the controller. This part of the design is internal change only and won’t block the KIP progress.
Compatibility Breakage
The KafkaPrincipal builder serializability is a binary incompatible change stated in the KIP. For the smooth rollout of this KIP, we will defer this part of the implementation until we hit next major incompatible release, i.e Apache Kafka 3.0. This means we will breakdown the effort as:
- For next 2.x release:
- Get new admin client forwarding changes
- Get the Envelope RPC implementation
- Get the forwarding path working and validate the function with fake principals in testing environment, without actual triggering in the production system
- For next 3.0 release:
- Introduce serializability to PrincipalBuilder
- Turn on forwarding path in production and perform end-to-end testing
Rejected Alternatives
- We discussed about the possibility of immediately building a metadata topic to propagate the changes. This seems aligned with the eventual metadata quorum path, but at a cost of blocking the current API migration towards the bridge release, since the metadata quorum design is much more complicated and requires more iterations. To avoid this extra dependency on other tracks, we should go ahead and migrate existing protocols to meet the bridge release goal sooner.
- We discussed about adding a tag field to represent the original request principal name to the request header for controller audit log purpose. This approach potentially has security concern and making a principal field optional also has potential risk of getting ignored by older brokers when IBP is badly configured. So instead we choose the envelope approach which is more old-fashioned, standard and secure. Plus, upgrading the request to newer version in the middle of proxy could generate unexpected missing fields which are also tricky to be deal with each RPC bump case by case.
- We also discussed having the forwarding broker do the authentication and only forward valid resources admin requests. This approach adds up the complexity of doing audit logging as a de-centralized approach, and we decide not to perform such optimization and let controller do the raw request authentication and logging instead.
Future Works
We have also discussed about migrating the metadata read path to controller-only for read-after-write consistency. This sounds like a nice improvement but needs more discussions on trade-offs between overloading controller and the metadata consistency, also the progress of Raft quorum design as well.