Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Adopted in 2.2

Table of Contents

Status

Current stateUnder DiscussionAdopted (in 2.2)

Discussion thread: here

JIRA: KAFKA-7352

...

The configuration option this KIP proposes to add to enable server-side expired-connection-kill is 'connections.max.reauth.ms' (not required to be prefixed with listener prefix or SASL mechanism name – such a value would be used across the cluster – but it may be as mentioned below). For example, "connections.max.reauth.ms=3600000".  The value represents the maximum value that could potentially be communicated as part of the new V1 SaslAuthenticateResponse.  The default value is 0, which means there is effectively no maximum communicated (0 will be sent, meaning "none"), server-side kill of expired connections is disabled, clients are not required to re-authenticate, and whether clients re-authenticate or not and at what interval is entirely up to them.  Existing SASL clients upgraded to v2.12.0 will be coded to not re-authenticate in this scenario.  The default value of 0 therefore results in no change whatsoever.

...

From a behavior perspective on the client side (including the broker when it is acting as an inter-broker client), when a v2.12.0-or-later SASL client connects to a v2.12.0 or later broker that supports re-authentication, the broker will communicate the session expiration time as part of the final SASL_AUTHENTICATE response.  If this value is positive, then the client will automatically re-authenticate before anything else unrelated to re-authentication is sent beyond that expiration point.  If the re-authentication attempt fails then the connection will be closed by the broker; retries are not supported.  If re-authentication succeeds then any requests received responses that queued up during re-authentication along with the Send that triggered the re-authentication to occur in the first place will subsequently be able to flow through (back to the client and along to the broker, respectively), and eventually the connection will re-authenticate again, etc.  Note also that the client cannot queue up additional send requests beyond the one that triggers re-authentication to occur until re-authentication succeeds and the triggering one is sent.

From a behavior perspective on the server (broker) side, when the From a behavior perspective on the server (broker) side, when the expired-connection-kill feature is enabled with a positive value the broker will communicate a session time via SASL_AUTHENTICATE and will close a connection when the connection is used past the expiration time and the specific API request is not directly related to re-authentication (SaslHandshakeRequest and SaslAuthenticateRequest).  In other words, if a connection sits idle, it will not be closed – something unrelated to re-authentication must traverse the connection before a disconnect will occur.

Metrics documenting re-authentications will be maintained in the Selector codeThey SOme will mirror existing metrics that document authentications.  For example: failed-reauthentication-{rate,total} and successful-reauthentication-{rate,total}.  There will also be a separate successful-authentication-total metric tagged with "auth=v0" to indicate it tracks no-reauth-{rate,total} metrics to indicate the subset of clients that successfully authenticate with a V0 SaslAuthenticateRequest.   This metric is helpful during migration (see Migration Plan(or no such request, which can happen with very old clients)   These metrics are helpful during migration (see Migration Plan for details) as it they will identify if/when all clients are properly upgraded before server-side expired-connection-kill functionality is enabled: the rate metric will be zero across all brokers when it is appropriate to enable the feature, and the total metric will be unchanging (zero after a restart).  There will also be reauthentication-latency-{avg,max} metrics that document the latency imposed by re-authentication.  It is unclear if this latency will be problematic, and the data collected via these metrics may be useful as we consider this issue in the future.

An additional server-side metric expired-connections-killed-totalmetric ExpiredConnectionsKilledCount will be created and maintained by the server-side Processor code to count the number of such events.  It should remain at zero if all clients and brokers are upgraded to v2.12.0 or later.  If it is non-zero then either an older client is connecting (which would be evidenced in the tagged auth=v0 metric successful-authentication-no-reauth metrics mentioned above) or a newer client is not re-authenticating correctly (which would indicate a bug).

A client-side metric will be created that documents the latency imposed by re-authentication.  It is unclear if this latency will be problematic, and the data collected via this metric may be useful as we consider this issue in the future.

Proposed Changes

Implementation Overview

The description of this KIP is actually quite straightforward from a behavior perspective – turn the feature on with the configuration option in the broker and it just works.  From an implementation perspective, though, the KIP is not so straightforward; a description of how it works therefore follows below.  Note that this description applies to the implementation only – none of this is part of the public API.

...

  1. Upgrade all brokers to v2.12.0 or later at whatever rate is desired with 'connections.max.reauth.ms' allowed to default to 0.  If SASL is used for the inter-broker protocol then brokers will check the SASL_AUTHENTICATE API version and use a V1 request when communicating to a broker that has been upgraded to 2.12.0, but the client will see the "0" session max lifetime and will not re-authenticate.  Their connections will not be killed.
  2. In parallel with (1) above, upgrade non-broker clients to v2.12.0 or later at whatever rate is desired.  SASL clients will check the SASL_AUTHENTICATE API version and use a V1 request when communicating to a broker that has been upgraded to 2.12.0, but the client will see the "0" session max lifetime and will not re-authenticate.  Their connections will not be killed.
  3. After (1) and (2) are complete, perform a rolling restart of all brokers and check the broker v0-tagged metrics successful-authentication-no-reauth-{rate,total} to confirm that they remain at zero.  This gives confidence that (1) and (2) are indeed complete.
  4. Update 'connections.max.reauth.ms' to a positive value and perform a rolling restart of brokers again. 
  5. Monitor the v0-tagged metrics successful-authentication-no-total – reauth-{rate,total} metrics – they will remain at 0 unless an older client connects to the broker.

...

It was initially proposed that we would have the broker reject (re-)authentications that occurred with a bearer token credential having a lifetime longer than the maximum allowed.  This was decided to be unnecessary here because the same thing can be done as part of token validation.

...

It was initially proposed that we make an existing, non-public ExpiringCredential interface part of the public API and leverage the background login refresh thread's refresh event to kick-start re-authentication on the client side for the refreshed credential.  This is unnecessary due to the combination a couple of a few factors.  First, the server (broker) indicates to the client what the expiration time is, and the low-level mechanism we have chosen on the client side can insert itself into the flow at the correct time – it does not ned an need an external mechanism; and second, the server will chose the token expiration time as the session expiration time if does not exceed the maximum allowable value, which means the refresh thread on the client side will have already refreshed the token (or, if it hasn't, the client can't make new connections anyway); third, .  We had at one time considered that the server will reject rejecting tokens whose remaining lifetime exceeds the maximum allowable session time was a third factor, but that functionality was rejected because it can be done as part of token validation as mentioned above.

Authenticating a Separate Connection and Transferring Credentials

...