Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Additionally, although not part of the public API, the POST /connectors/<name>/tasks endpoint will be effectively disabled for public use. This endpoint should never be called by users, but since until now there hasn't been anything to prevent them from doing so, it should still be noted that anything that relies that endpoint will no longer work after these changes are made. The expected impact of this is low, however; the Connect framework (and the connectors it runs) handle the generation and storage of task configurations and there's no discernible reason for using that endpoint directly instead of going through the public Connect REST API.

Proposed Changes

Additional Connect subprotocol

A new Connect subprotocol, sessioned, will be implemented that will be identical to the cooperative incremental protocol but a higher protocol version number (2, instead of the current version for cooperative incremental rebalancing, which is 1). One downside of this approach is that the use of cooperative incremental assignments will be required in order to enable this new security behavior; however, given the lack of any serious complaints about the new rebalancing protocol thus far, this seems preferable to trying to enable this behavior across both assignment styles. If the connect.protocol property is set to sessioned, the worker will advertise this new sessioned protocol to the Kafka group coordinator as a supported (and, currently, most preferable) protocol.

If that the sessioned protocol is then agreed on by the cluster during group coordination, a session key will be randomly generated by the leader and distributed to the cluster via the config topic. This key will be used by followers to sign requests to the internal endpoint, and verified by the leader to ensure that the request came from a current group member. It is imperative that inter-worker communication have some kind of transport layer security; otherwise, this session key will be leaked during rebalance to anyone who can eavesdrop on request traffic.

Key rotation, request signing, request verification

Periodically (with frequency dictated by the internal.request.key.rotation.interval.ms property), the leader will compute a new session key and distribute it to the cluster.

...

Valid requests will be met with an HTTP 200 response; invalid requests will be met with either HTTP 400 (bad request) should they lack the required signature and signature algorithm headers, specify an invalid (non-base-64-decodable) signature, or specify a signature algorithm that isn't permitted by the leader, and HTTP 403 (forbidden) if they contain well-formed values for the signature and signature algorithm headers, but which fail request verification.

Requests with expired keys

The leader will only accept requests signed with the most current key. This should not cause any major problems; if workers already engage in an infinite retry loop when requests to forward tasks to the leader fail with a short (250 millisecond) backoff period in between each retry. If a follower attempts to make a request with an expired key (which should be quite rare and only occur if the request is made by a follower that is not fully caught up to the end of the config topic), the initial request will fail, but will be subsequently retried after a backoff period. This backoff period should leave sufficient room for the follower to read the new session key from the config topic. If the first 240 requests fail with HTTP 403, it will be assumed that this is due to an out-of-date session key; a debug-level message about the subsequent retry will be logged in place of the current error-level log message of "Failed to reconfigure connector's tasks, retrying after backoff: " followed by a stack trace. Since the backoff period is 250 milliseconds, this should give at least one minute of leeway for an outdated key to be updated. This grace period should leave sufficient room for the follower to read the new session key from the config topic. If longer than that is required, the usual error-level log messages will begin to be generated by the worker.

New JMX worker metric

Finally, a new worker JMX metric will be exposed that can be used to determine whether the new behavior proposed by this KIP is enabled:

...