Parent KIP

KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum (Accepted)

Status

Current state: Accepted

Discussion thread: here

JIRA: Unable to render Jira issues macro, execution error.

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

As part of the KIP-500 initiative, we need to build a bridge release version of Kafka that could isolate the direct Zookeeper write access only to the controller. Protocols that alter cluster/topic configurations, security configurations or quotas, topics etc, should be migrated for sure as they are still relying on arbitrary broker to Zookeeper write access.

Take config change protocol for example. The current AlterConfig request propagation path is:

The admin client issues an (Incremental)AlterConfig request to broker
Broker updates the zookeeper path storing the metadata
If #2 successful, returns to the client
All brokers refresh their metadata upon ZK notification

Here we use ZK as the persistent storage for all the config changes, and even some brokers are not able to get in sync with ZK due to transient failures, a successful update shall be eventually guaranteed. In this KIP we would like to maintain the same level of guarantee, and make the controller as the single writer to modify the config metadata in ZK.

Proposed Changes

Take AlterConfig as an example to understand the changes we are making.

Change AlterConfig Request Routing

The new simple routing change makes sure only controller needs to write to ZK, while other broker shall just wait for the metadata update from ZK notification eventually. As we have the source of truth configs stored in ZK still, any re-election of controller shall be safe.

For admin RPCs who are currently sending directly to the controller, brokers should support the proxy of such requests, with a revised update path during the bridge release:

The admin client issues an (Incremental)AlterConfig request to a random broker
The broker redirects the request to the controller
The controller updates the config, and store it in ZK
If #3 successful, returns to the proxy broker
The proxy broker returns to the client as success
ZK update will be propagated towards all affected brokers in the cluster

This whole update strategy change would be applied to all the direct ZK mutation paths, including:

AlterConfig
IncrementalAlterConfig
CreateAcls
DeleteAcls
AlterClientQuotas
CreateDelegationToken
RenewDelegationToken
ExpireDelegationToken

Internal CreateTopicsRequest Routing

Certain edge cases we would also like to fix is for the internal topic creation.

FindCoordinator protocol has an internal topic creation logic when the cluster receives the request for the first time as transaction log topic and consumer offset topic are lazily initialized.
MetadataRequest protocol also contains an internal topic creation logic when we are looking for metadata for a non-existing internal topic and auto-topic-creation is enabled.

Currently the receiving broker shall just utilize its own ZK client to create internal topics, which is disallowed in the bridge release. In the post KIP world, if the broker receiving the topic creation request is the active controller, it will just handle it; otherwise, the receiving broker shall resend a new CreateTopicRequest to the active controller instead and let controller take care of the rest, while waiting for the response in the mean time.

One thing to note that at the moment the direct ZK access bypasses the CreateTopicPolicy. This is in fact a hole in the topic creation logic that we should fix. From now on, if a MetadataRequest tries to create an internal topic but failed, receiving broker will reply a fatal error to let the client fail fast and populate the message to the users.

Routing Request Security

For older requests that need redirection, forwarding broker will just use its own authorizer to verify the principals. When the request looks good, it will just forward the request with its own credentials, so that the controller broker will only validate the broker principal in the forwarded request. The only exceptional case is the controller audit log which needs a principal name of the request, so we will add an optional tag called "InitialPrincipalName" to the header when sending the proxy request.

In addition, to avoid exposing this forwarding power to the admin clients, the routing request shall be forwarded towards the controller broker internal endpoint which should be only visible to other brokers inside the cluster in the KIP-500 controller. Any admin configuration request with broker principal should not be going through the public endpoint and will be rejected for security purpose.

Public Interfaces

Deprecate Client Side Controller Access

Starting from the first release version of KIP-590, the following RPCs shall be forwarded to the controller from any broker:

AlterPartitionReassignment
CreatePartition
CreateTopics
DeleteTopics
UpdateFeatures (ongoing with KIP-584)

And they would follow the same configuration request forwarding strategy discussed in the previous section.

The reason is that we shall remove "ControllerNodeProvider" on the admin client, so that clients no longer have direct access towards the controller. Thus the active controller is properly isolated from the outside world, according to the KIP-631. To be more strict, the "ControllerId" field in MetadataResponse shall be set to -1 when the original request comes from a non-broker client. We shall use the request listener name to distinguish whether a given request is inter-broker, or from the client.

Protocol Bump

We also need to bump the Metadata RPC to v10 to propagate internal topic creation policy violation. Specifically:

1. For newer clients, return POLICY_VIOLATION when the topic creation policy is violated. In the application level, we should swap the error message with the actual failure reason such as "violation of topic creation policy when attempting to auto create internal topic through MetadataRequest."

2. For older client, return AUTHORIZATION_FAILED to fail the client quickly as well. It's not a perfect solution as we don't have a notification path for older clients, but at least the system admin could check for broker log when hitting this issue.

Security Access Changes

Broker Authorization Override During Forwarding

To support the authorization of RPCs during redirection, we would let CLUSTER_ACTION to override the following operation principals:

Operation	Resource	API
ALTER	Cluster	CreateAcls/DeleteAcls/AlterPartitionReassignments/UpdateFeatures
ALTER	Topic	CreatePartitions
ALTER_CONFIGS	Topic/Cluster	AlterConfig/IncrementalAlterConfig
CREATE	Topic	CreateTopics
token authentication	token	Create/Renew/DeleteToken
DELETE	Topic	DeleteTopics

This ensures that the forwarding broker could use its own principal to authenticate and proceed on certain ZK mutation operations. To distinguish which request is forwarded, the controller will try to differentiate requests coming from inter broker listener and advertised listener. If the request is from inter broker listener, we treat it as a forwarding request and do the override authentication.

Although some users may configure the same listener name for both client and inter broker communication, which invalidates the differentiation process, this override approach still guarantees no extra security access breach since CLUSTER_ACTION implies either the broker or a super user.

If the authorization still fails on the controller side, it indicates an internal security setup error which should be addressed on the broker cluster, not the client. We shall propagate a new error code to the original client to educate users to fix:

Errors.java

BROKER_AUTHORIZATION_FAILURE(92, "Authorization failed for the request during forwarding, this indicates an internal error on the broker cluster security setup.", BrokerAuthorizationFailureException::new);

Unfortunately for older admin clients they couldn't interpret this code, so an UNKNOWN_SERVER_ERROR will be presented, which is less ideal but still good enough to motivate users to check the broker side log for authorization failure. We intended to avoid returning AUTHORIZATION failure to the old client so that users don't waste time debugging any client side security setup.

New Tag for Principal Name

We are also going to add a tag field to represent the original request principal name to the request header for controller audit log purpose.

RequestHeader.json

{
  "type": "header",
  "name": "RequestHeader",
  // Version 0 of the RequestHeader is only used by v0 of ControlledShutdownRequest.
  //
  // Version 1 is the first version with ClientId.
  //
  // Version 2 is the first flexible version.
  "validVersions": "0-2",
  "flexibleVersions": "2+",
  "fields": [
    { "name": "RequestApiKey", "type": "int16", "versions": "0+",
      "about": "The API key of this request." },
    { "name": "RequestApiVersion", "type": "int16", "versions": "0+",
      "about": "The API version of this request." },
    { "name": "CorrelationId", "type": "int32", "versions": "0+",
      "about": "The correlation ID of this request." },
    ...
    // ----- new optional field ----
    { "name": "InitialPrincipalName", "type": "string", "tag": 0, "taggedVersions": "2+", "ignorable": true,
      "about": "Optional value of the initial principal name when the request is redirected by a broker." },
    // ----- end new field ---------
  ]
}

Monitoring Metrics

To effectively monitor the admin request forwarding status, we would the following metered metric:

MBean:kafka.server:type=RequestMetrics,name=NumRequestsForwardingToControllerPerSec,clientId=([-.\w]+)

to visualize how many RPC are inflight from each admin client. It will be added via Yammer metrics.

Compatibility, Deprecation, and Migration Plan

The upgrade path shall be guarded by the inter.broker.protocol (IBP) to make sure the routing behavior is consistent. After first rolling bounce to upgrade the binary version, all fellow brokers are still handling ZK mutation requests by themselves. With the second IBP bump rolling bounce, all upgraded brokers will be using the new routing algorithm effectively described in this KIP.

Rejected Alternatives

We discussed about the possibility of immediately building a metadata topic to propagate the changes. This seems aligned with the eventual metadata quorum path, but at a cost of blocking the current API migration towards the bridge release, since the metadata quorum design is much more complicated and requires more iterations. To avoid this extra dependency on other tracks, we should go ahead and migrate existing protocols to meet the bridge release goal sooner.
We thought about adding an alerting metrics called request-forwarding-to-controller-authorization-fail-count in an effort to help administrator detect wrong security setup sooner. However, there should already be metrics monitoring request failures, so this metric could be optional.
We thought about monitoring older client connections in the long term after bridge release, when we perform some incompatible changes to the Raft Quorum, to better capture the timing for a major version bump. However, KIP-511 also has already exposed metrics like an "unknown" software name and an "unknown" software version which could serve for this purpose.
We discussed about adding a new RPC type called Envelope to wrap the original request during the forwarding. Although the Envelope API provides certain privileges like data embedding and principal embedding, it creates a security hole by letting a malicious user impersonate any resending broker. Passing the principal around also increases the vulnerability, compared with other standard ways such as passing a verified token, but it is unfortunately not fully supported with Kafka security. So for the security concerns, we are abandoning the Envelope approach and fallback to just forward the raw admin requests.

Future Works

We have also discussed about migrating the metadata read path to controller-only for read-after-write consistency. This sounds like a nice improvement but needs more discussions on trade-offs between overloading controller and the metadata consistency, also the progress of Raft quorum design as well.

New Secure Endpoint

To maintain the same level of security going along in the post-ZK world, the broker-controller communication should have extra security guarantee. To make that happen, we will introduce a separate `ControllerEndpoint` for user to configure the exclusive access of forwarding requests to only go through this tunnel. Getting a separate communication channel also helps differentiating whether the request is from admin client or forwarded, which means the forwarding brokers don't have to bump the request version unnecessarily.

This part of the design is dependent on the Controller refactoring effort, and more details shall reveal for subsequent KIPs. It won't block the acceptance for this KIP either, since the forwarding behavior shall be the same.

Space shortcuts

Child pages

Parent KIP

Status

Motivation

Proposed Changes

Change AlterConfig Request Routing

Internal CreateTopicsRequest Routing

Routing Request Security

Public Interfaces

Deprecate Client Side Controller Access

Protocol Bump

Security Access Changes

Broker Authorization Override During Forwarding

New Tag for Principal Name

Monitoring Metrics

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Future Works

New Secure Endpoint

Space shortcuts

Child pages

KIP-590: Redirect Zookeeper Mutation Protocols to The Controller

Parent KIP

Status

Motivation

Proposed Changes

Change AlterConfig Request Routing

Internal CreateTopicsRequest Routing

Routing Request Security

Public Interfaces

Deprecate Client Side Controller Access

Protocol Bump

Security Access Changes

Broker Authorization Override During Forwarding

New Tag for Principal Name

Monitoring Metrics

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Future Works

New Secure Endpoint