Status

Current state: Under Discussion

Discussion thread: here

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

The adoption of KIP-255: OAuth Authentication via SASL/OAUTHBEARER in release 2.0.0 creates the possibility of using information in the bearer token to make authorization decisions. Unfortunately, however, Kafka connections are long-lived, so there is no ability to change the bearer token associated with a particular connection. Allowing SASL connections to periodically re-authenticate would resolve this. In addition to this motivation there are two others that are security-related. First, to eliminate access to Kafka for connected clients, the current requirement is to remove all authorizations (i.e. remove all ACLs). This is necessary because of the long-lived nature of the connections. It is operationally simpler to shut off access at the point of authentication, and with the release of KIP-86: Configurable SASL Callback Handlers it is going to become more and more likely that installations will authenticate users against external directories (e.g. via LDAP). The ability to stop Kafka access by simply disabling an account in an LDAP directory (for example) is desirable. The second motivating factor for re-authentication related to security is that the use of short-lived tokens is a common OAuth security recommendation, but issuing a short-lived token to a Kafka client (or a broker when OAUTHBEARER is the inter-broker protocol) currently has no benefit because once a client is connected to a broker the client is never challenged again and the connection may remain intact beyond the token expiration time (and may remain intact indefinitely under perfect circumstances). This KIP proposes adding the ability for SASL clients (and brokers when a SASL mechanism is the inter-broker protocol) to re-authenticate their connections to brokers. If OAUTHBEARER is the SASL mechanism then a new bearer token will appear on the session, replacing the old one. This KIP also proposes to add the ability for brokers to close connections that continue to use expired sessions.

Public Interfaces

This KIP proposes the addition of two configuration options to enable the client-side re-authentication and server-side expired-connection-kill features (both option defaults result in no functionality change, of course, so there is no change to existing behavior in the absence of explicit opt-ins). This KIP also proposes bumping the version number for the SASL_AUTHENTICATE API to 1 (with no change in wire format since the payload is a flexible byte buffer already) so that servers can indicate the session expiration time to clients via an additional round-trip/response. Clients also use the max version number supported by the server to determine if they are connected to a broker that supports re-authentication (true if version > 0). This KIP also adds new metrics as described below.

The configuration option this KIP proposes to add to enable client-side re-authentication is 'sasl.login.refresh.reauthenticate.enable' – it defaults to false, and when explicitly set to true in the context of a client (including a broker when it acts as an inter-broker client) the client-side re-authentication feature will be enabled for any SASL connection. As mentioned above, the SASL_AUTHENTICATE API will have its version number bumped so that any client with the above configuration value set to true will not try to re-authenticate to a broker that has not been upgraded to the necessary version and therefore does not support responding to such requests.

The configuration option this KIP proposes to add to enable server-side expired-connection-kill is 'connections.max.reauth.ms' – it must be prefixed with listener prefix and SASL mechanism name in lower-case. For example, "sasl_ssl.oauthbearer.connections.max.reauth.ms=3600000". The value defaults to 0. When explicitly set to a non-zero value the server will reject any authentication or re-authentication attempt from a client that presents a bearer token whose remaining lifetime exceeds the time at which the (re-)authentication occurs plus a number of milliseconds equal to the absolute value of the configured value (for example, the remaining token lifetime at the time of (re-)authentication must not exceed one hour if the configured value is either -3600000 or +3600000). When explicitly set to a positive number, in addition to the lifetime check for SASL/OAUTHBEARER, the server will disconnect any SASL connection that does not re-authenticate and subsequently uses the connection for any purpose other than re-authentication at any point beyond the expiration point. For example, if the configured value is 3600000 and the remaining token lifetime at the time of authentication is 45 minutes, the server would kill the connection if it is not re-authenticated within 45 minutes and it is then actively used for anything other than re-authentication. As a further example, if the configured value is 3600000 and the mechanism is not OAUTHBEARER (e.g. it is PLAIN, SCRAM-related, or GSSAPI) then the server would kill the connection if it is not re-authenticated within 1 hour and it is then actively used for anything other than re-authentication.

The 'connections.max.reauth.ms' configuration option supports positive and negative values to facilitate migration; typically the value will first be set to a negative value and then it will be converted to its absolute value to fully enable the feature once metrics indicate all clients are upgraded and re-authenticating (see Migration Plan for details).

Neither of the above configuration options will be dynamically changeable; restarts will be required if either value is to be changed.

From a behavior perspective on the client side (again, including the broker when it is acting as an inter-broker client), when the client-side re-authentication option is enabled and a SASL client connects to a broker that supports re-authentication, the broker will communicate the session expiration time as part of the authentication "dance". The client will then automatically re-authenticate on or after that point before sending anything else unrelated to re-authentication. If the re-authentication attempt fails then the connection will be closed by the broker; retries are not supported. If re-authentication succeeds then any requests that queued up during re-authentication will subsequently be able to flow through, and eventually the connection will re-authenticate again, etc.

From a behavior perspective on the server (broker) side, when the broker-side expired-connection-kill feature is fully enabled with a positive value the broker will close a connection authenticated via the indicated SASL mechanism when the connection is used past the expiration time and the specific API request is not directly related to re-authentication (ApiVersionsRequest, SaslHandshakeRequest, and SaslAuthenticateRequest). In other words, if a connection sits idle, it will not be closed – something unrelated to re-authentication must traverse the connection before a disconnect will occur.

Metrics documenting re-authentications will be maintained. They will mirror existing metrics that document authentications. For example: failed-reauthentication-{rate,total} and successful-reauthentication-{rate,total}.

A broker metric will be created that documents the number of API requests unrelated to re-authentication that are made over a connection that is considered expired. This metric may be non-zero only when the configuration value is negative. It helps operators ensure that all clients are properly upgraded and re-authenticating before fully turning on server-side expired-connection-kill functionality (by changing the negative configuration value to its absolute value): the metric will be unchanging across all brokers when it is safe to fully enable the feature with the absolute value.

A broker metric will be created that documents the number connections killed by the server-side expired-connection-kill functionality. This metric may be non-zero only when the configuration value is positive, and it indicates that a client is connecting to the broker with re-authentication either unavailable (i.e. an older client) or disabled.

A client metric will be created that documents the latency imposed by re-authentication. It is unclear if this latency will be problematic, and the data collected via this metric may be useful as we consider this issue in the future.

Proposed Changes

Implementation Overview

The description of this KIP is actually quite straightforward from a behavior perspective – turn the feature on with the configuration options in both the client and the broker and it just works. From an implementation perspective, though, the KIP is not so straightforward; a description of how it works therefore follows below. Note that this description applies to the implementation only – none of this is part of the public API.

This implementation works at a very low level in the Kafka stack, at the level of the network code. It is therefore transparent to all clients – it just works with no knowledge or accommodation required on their part. When a client makes a request to a broker the request is intercepted at the level of the Selector class and a check is done to see if re-authentication is enabled; if it is, and the broker supports re-authentication, then the connection is re-authenticated at that point before the request is allowed to flow through. The solution is elegant because it re-uses existing code paths while requiring no code changes higher up in the stack.

This KIP transparently adds re-authentication support for all uses, which at this point includes the following:

org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient
- org.apache.kafka.clients.consumer.KafkaConsumer
- org.apache.kafka.connect.runtime.distributed.WorkerGroupMember
kafka.controller.ControllerChannelManager
org.apache.kafka.clients.admin.KafkaAdminClient
org.apache.kafka.clients.producer.KafkaProducer
kafka.coordinator.transaction.TransactionMarkerChannelManager
kafka.server.ReplicaFetcherBlockingSend (kafka.server.ReplicaFetcherThread)
kafka.admin.AdminClient
kafka.tools.ReplicaVerificationTool
kafka.server.KafkaServer
org.apache.kafka.trogdor.workload.ConnectionStressWorker

The final issue to describe is how/when a KafkaChannel instance (each of which corresponds to a unique network connection) is told to re-authenticate. Each KafkaChannel instance will remember the session expiration time communicated during (re-)authentication (if any); the code in the Selector class will check to see if that time has passed and will start the re-authentication process if so.

Compatibility, Deprecation, and Migration Plan

With respect to compatibility, there is no impact to existing installations because the default is for the feature to be completely turned off on both the client and server.

With respect to migration, the approach would be as follows:

Upgrade all brokers to v2.1.0 or later
After (1) is complete, turn on re-authentication for brokers (as inter-broker clients, via 'sasl.login.refresh.reauthenticate.enable') at whatever rate is desired -- just eventually, at some point, get the client-side feature turned on for all brokers so that inter-broker connections are re-authenticating. (Skip this step and consider it complete if SASL/OAUTHBEARER is not used for inter-broker communication.)
After (2) is complete, partially enable the server-side kill functionality with a negative value for '[listener].[mechanism].connections.max.reauth.ms' on all brokers. The metric documenting the number of API requests made over expired connections will begin to increase until the next step (4) is completed. No connections will be killed.
In parallel with (1), (2), and (3) above, upgrade non-broker clients to v2.1.0 or later and turn their re-authentication feature on. SASL clients will check the SASL_AUTHENTICATE API version and only re-authenticate to a broker that has been upgraded to 2.1.0 or later (note that the ability of a broker to respond to a re-authentication cannot be turned off -- it is always on beginning with version 2.1.0, and it just sits there doing nothing if it isn't exercised by an enabled client).
After (3) and (4) are complete, check the broker metric documenting the number of API requests made over expired connections to confirm that it is no longer increasing. Once you are satisfied that (1), (2), (3), and (4) are indeed complete you can fully enable the server-side expired-connection-kill feature on each broker by changing the '[listener].[mechanism].connections.max.reauth.ms' value from its negative value to its absolute value and restarting the broker.
Monitor the metric that documents the number of killed connections – it will remain at 0 unless an older client or one that does not have re-authentication enabled connects to the broker via the SASL mechanism.

Rejected Alternatives

Delaying Support for Brokers Killing Connections

It was initially proposed that we defer adding the ability for brokers to kill connections using expired credentials to a future KIP. This functionality is actually easier to add than re-authentication, and re-authentication without this feature doesn't really improve security (because it can't be enforced). Adding the ability to kill connections using an expired bearer token without the ability for the client to re-authenticate also does not make sense as a general feature – it forces the client to essentially "recover" from what looks like a network error on a periodic basis. So we will implement both features at the same time.

Delaying Support for non-OAUTHBEARER SASL Mechanisms

It was initially proposed that we defer adding the ability for SASL clients to re-authenticate when using a non-OAUTHBEARER mechanism (e.g. PLAIN, GSSAPI, and SCRAM-related). We were able to identify how all mechanisms could be readily and easily supported.

Highest-Level Approach: Inserting Requests into Clients' Queues

The original implementation injected requests directly into both asynchronous and synchronous I/O clients' queues. This implementation was too high up in the stack and required significantly more work and code than any other solution. It also was much harder to maintain because it became entwined in the implementation of every client.

High-Level Approach: One-Size-Fits-All With Additional KafkaClient Methods

One implementation injected requests into the KafkaClient via a new method (shown below). This one-size-fits-all approach worked as long as synchronous I/O clients periodically checked for and sent injected requests related to re-authentication (via a new method, also shown below). This implementation was harder to maintain than the chosen approach because it crossed existing module boundaries related to security code, networking code, and code higher up in the stack – it imposed requirements (however slight) on code higher up in the stack in order for re-authentication to work. This violation of existing modularity caused concern. The code was also 2-3 times bigger (at least) relative to the accepted implementation, and it was incrementally (though not dramatically) more difficult to test. It did create the possibility of interleaving requests related to re-authentication with requests that the client was otherwise sending, which minimized latency spikes due to re-authentication, but that advantage was difficult to quantify and therefore did not tip the balance in favor of this option.

org.apache.kafka.clients.KafkaClient additions

    /**
     * Return true if any node has a re-authentication request either enqueued and
     * waiting to be sent or already in-flight. A call to {@link #poll(long, long)}
     * is required to send and receive/process the results of such requests. <b>An
     * owner of this instance that does not implement a run loop to repeatedly call
     * {@link #poll(long, long)} but instead only sends requests synchronously
     * on-demand to a single node must call this method periodically -- and invoke
     * {@link #poll(long, long)} if the return value is {@code true} -- to ensure
     * that any re-authentication requests that have been injected are sent and
     * processed in a timely fashion.</b>
     * <p>
     * Example code to re-authenticate a connection across several
     * requests/responses is as follows:
     * 
     * <pre>
     * // Send multiple requests related to re-authentication in the synchronous
     * // use case, completing the re-authentication exchange.
     * while (kafkaClient.hasReauthenticationRequest())
     *     // Returns an empty list in synchronous use case.
     *     kafkaClient.poll(Long.MAX_VALUE, time.milliseconds());
     * // The connection is ready for use, and if there originally was a
     * // re-authentication request then as many requests as required to
     * // complete the exchange have been sent.
     * </pre>
     * 
     * Alternatively, to only send one re-authentication request and receive its
     * response (which allows us to interleave other requests to the single node to
     * which we are connected before subsequent requests related to the multi-step
     * re-authentication exchange are sent):
     * 
     * <pre>
     * // Send a single request related to re-authentication in the synchronous
     * // use case, potentially (but not necessarily) completing the
     * // re-authentication exchange.
     * while (kafkaClient.hasReauthenticationRequest()) {
     *     // Returns an empty list in synchronous use case.
     *     kafkaClient.poll(Long.MAX_VALUE, time.milliseconds());
     *     if (!kafkaClient.hasInFlightRequests())
     *         break; // Response has been received.
     * }
     * // The connection is ready for use, and if there was a
     * // re-authentication request then either the exchange is finished or
     * // there is another re-authentication request available to be sent.
     * </pre>
     * 
     * @return if any node has a re-authentication request either enqueued and
     *         waiting to be sent or already in-flight
     * @see #enqueueAuthenticationRequest(ClientRequest)
     */
    default boolean hasReauthenticationRequest() {
        return false;
    }


    /**
     * Enqueue the given request related to re-authenticating a connection. This
     * method is guaranteed to be thread-safe even if the class implementing this
     * interface is generally not.
     * 
     * @param clientRequest
     *            the request to enqueue
     * @see #hasReauthenticationRequest()
     */
    default void enqueueAuthenticationRequest(ClientRequest clientRequest) {
        // empty
    }

Adding an ExpiringCredential Public API

It was initially proposed that we make an existing, non-public ExpiringCredential interface part of the public API and leverage the background login refresh thread's refresh event to kick-start re-authentication on the client side for the refreshed credential. This is unnecessary due to the combination of a few factors. First, the server (broker) indicates to the client what the expiration time is, and the low-level mechanism we have chosen on the client side can insert itself into the flow at the correct time – it does not ned an external mechanism; second, the server will chose the token expiration time as the session expiration time, which means the refresh thread on the client side will have already refreshed the token (or, if it hasn't, the client can't make new connections anyway); third, the server will reject tokens whose remaining lifetime exceeds the maximum allowable session time.

Authenticating a Separate Connection and Transferring Credentials

One alternative idea is to add two new request types: "ReceiveReauthenticationNonceRequest" and "ReauthenticateWithNonceRequest". When re-authentication needs to occur the client would make a separate, new connection to the broker and send a "ReceiveReauthenticationNonceRequest" to the broker to have it associate a nonce with the authenticated credentials and return the nonce to the client. Then the client would send a "ReauthenticateWithNonceRequest" with the returned nonce over the connection that it wishes to re-authenticate; the broker would then replace the credentials on that connection with the credentials it had previously associated with the nonce. I don't know if this would work (might there be some issue with advertised vs. actual addresses and maybe the possibility of there being a load balancer? Could we be guaranteed the ability to connect to the exact same broker as our existing connection?) . If it could work then it does have the advantage of requiring the injection of just a single request over an existing connection that would return very quickly rather than 3 separate requests of which at least one might take a while to return (to potentially retrieve a public key for token signature validation, for example; the validation itself isn't exactly free, either, even if the public key is already cached). One disadvantage of the alternative, nonce-based approach is that it requires the creation of a separate connection, including TLS negotiation, and that is very expensive compared to sending 3 requests over an existing connection (which of course already has TLS negotiated).

Brute-Force Client-Side Kill

A brute-force alternative is to simply kill the connection on the client side when the background login thread refreshes the credential. The advantage is that we don't need a code path for re-authentication – the client simply connects again to replace the connection that was killed. There are many disadvantages, though. The approach is harsh – having connections pulled out from underneath the client will introduce latency while the client reconnects; it introduces non-trivial resource utilization on both the client and server as TLS is renegotiated; and it forces the client to periodically "recover" from what essentially looks like a failure scenario. While these are significant disadvantages, the most significant disadvantage of all is that killing connections on the client side adds no security – trusting the client to kill its connection in a timely fashion is a blind and unjustifiable trust.

Brute-Force Server-Side Kill

We could kill the connection from the server side instead, when the token expires. But in this case, if there is no ability for the client to re-authenticate to avoid the killing of the connection in the first place, then we still have all of the harsh approach disadvantages mentioned above.

Space shortcuts

Child pages

Status

Motivation

Public Interfaces

Proposed Changes

Implementation Overview

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Delaying Support for Brokers Killing Connections

Delaying Support for non-OAUTHBEARER SASL Mechanisms

Highest-Level Approach: Inserting Requests into Clients' Queues

High-Level Approach: One-Size-Fits-All With Additional KafkaClient Methods

Adding an ExpiringCredential Public API

Authenticating a Separate Connection and Transferring Credentials

Brute-Force Client-Side Kill

Brute-Force Server-Side Kill

Space shortcuts

Child pages

KIP-368: Allow SASL Connections to Periodically Re-Authenticate

Status

Motivation

Public Interfaces

Proposed Changes

Implementation Overview

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Delaying Support for Brokers Killing Connections

Delaying Support for non-OAUTHBEARER SASL Mechanisms

Highest-Level Approach: Inserting Requests into Clients' Queues

High-Level Approach: One-Size-Fits-All With Additional KafkaClient Methods

Adding an ExpiringCredential Public API

Authenticating a Separate Connection and Transferring Credentials

Brute-Force Client-Side Kill

Brute-Force Server-Side Kill