Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

An ideal restriction of this endpoint would guarantee that requests made to it come exclusively from another work in the same cluster. Although mutual authentication via TLS, for example, may seem like a viable approach, this only accomplishes authentication and not authorization; that is, it verifies that the request comes from a trusted party with a given identity, but it does not make distinctions about whether party should be allowed to perform actions on the cluster. If mutual authentication is used for the Connect REST API, then this endpoint is still effectively unsecured since any user of the public REST API may also access this endpoint.

It should be noted that the goal here is not to completely secure any Kafka Connect cluster, but rather to patch an existing security hole for clusters that are already intended to be secure. A few examples of steps that should be taken in order to secure a Kafka Connect cluster include securing the public REST API (which can be done using a Connect REST extension), securing the worker group (which can be done with the use of ACLs on the Kafka broker), and securing the internal topic used by Connect to store configurations, statuses, and offsets for connectors (which can also be done with the use of ACLs on the Kafka broker). If any of these steps are not taken, the cluster is insecure anyways; therefore, relying on these precautions being in place in order to implement a fix for the problem posed by the internal REST endpoint used by Connect is acceptable.

Public Interfaces

There will be four new configurations added for distributed workers:

...

Summary: The REST endpoint could be removed entirely and replaced with a Kafka topic. Either an existing internal Connect topic (such as the configs topic) could be used, or a new topic could be added to handle all non-forwarded follower-to-leader communication.

Rejected because:  it would require both Achieving consensus in a Connect cluster about whether to begin engaging in this new topic-based protocol would require either reworking the Connect group coordination protocol and an internal Connect topic. There would be no clear way to achieve consensus amongst the workers in a cluster on whether to switch to this new behavior without also reworking the group coordination protocol; at that pointprotocol or installing several new configurations and a multi-stage rolling upgrade in order to enable it. Requiring new configurations and a multi-stage rolling upgrade for the default use case of a simple version bump for a cluster would be a much worse user experience, and if the group coordination protocol is going to be reworked, we might as well just use the group coordination protocol to distribute keys instead. Additionally, the added complexity of switch from a synchronous to an asynchronous means of communication for relaying task configurations to the leader would complicate the implementation enough that reworking the group coordination protocol might even be a simpler approach with smaller changes required.

Open Questions

  • Will it be necessary to support multiple keys at once, in the event that a follower worker makes a request to the internal endpoint during a rebalance (in which case the follower and worker would be using different keys)? Is this event even possible?
    • The DistributedHerder class appears to retry infinitely when failures are encountered in task reconfiguration. If this happens on a separate thread from (or just doesn't block) the rebalance logic (which would be responsible for updating the key used by the herder) then it's possible this is fine. However, if this happens on the same thread as (and effectively blocks) the rebalance logic, then there will be deadlock as the worker will have to successfully complete the request for task reconfiguration before receiving its new key, and it will have to receive its new key before it can successfully complete the request for task reconfiguration.