Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The leader will only accept requests signed with the most current key. This should not cause any major problems; if a follower attempts to make a request with an expired key (which should be quite rare and only occur if the request is made by a follower that is not fully caught up to the end of the config topic), the initial request will fail, but will be subsequently retried after a backoff period. This backoff period should leave sufficient room for the rebalance to complete. If the first four 240 requests fail with HTTP 403, it will be assumed that this is due to an out-of-date session key; a debug-level message about the subsequent retry will be logged in place of the current error-level log message of "Failed to reconfigure connector's tasks, retrying after backoff: " followed by a stack trace. Since the backoff period is 250 milliseconds, this should give at least one second minute of leeway for an outdated key to be updated. If longer than that is required, the usual error-level log messages will begin to be generated by the workerbe generated by the worker.

Finally, a new worker JMX metric will be exposed that can be used to determine whether the new behavior proposed by this KIP is enabled:

  • MBeankafka.connect:type=connect-worker-metrics
  • Metric nameconnect-protocol
  • DescriptionThe Connect protocol used by this cluster
  • Value: The Connect subprotocol in use based on the latest join group response for this worker joining the Connect cluster.

Compatibility, Deprecation, and Migration Plan

...

The only seriously bad scenario is if a follower worker is configured to use a request signing algorithm that isn't allowed by the leader. In this case, any failure will only occur if/when that follower starts up a connector and then has to forward tasks for that connector to the leader, which may not happen immediately. Once that failure occurs, an endless failure loop will occur wherein the follower endlessly retries to send those task configurations to the leader and pauses by the backoff interval in between each failed request.pauses by the backoff interval in between each failed request.

There will be two symptoms that could indicate to the user that this has occurred:

  1. Failure of connectors hosted by the follower worker to spawn tasks
  2. Error-level log messages emitted by the follower worker

There are two ways to rectify this situation; either shut down the follower and restart it after editing its configuration to use a request signing algorithm permitted by the leader, or shut down all other workers in the cluster that do not permit the request signing algorithm used by the follower, reconfigure them to permit it, and then restart them.

...

Neither of these scenarios warrant error-level logging messages as they could theoretically be brought about by an intentional downgrade.

The newly-proposed connect-protocol JMX metric can be used to monitor whether internal request verification is enabled for a cluster; if its value is sessioned (or, presumably, a later protocol), then request verification should be enabled.

Reverting an upgrade

The group coordination protocol will be used to ensure that all workers in a cluster support verification of internal requests before this behavior is enabled; therefore, a rolling upgrade of the cluster will be possible. In line with the regression plan for KIP-415: Incremental Cooperative Rebalancing in Kafka Connect, if it is desirable to disable this behavior for some reason, the connect.protocol configuration can be set to compatible or default for one (or more) workers, and it will automatically be disabled.

...