Table of Contents |
---|
Status
Current state: Under Adopted
Discussion thread: here
Discussion Vote thread: here
JIRA:
Jira | ||||||
---|---|---|---|---|---|---|
|
...
The response of this method will be changed slightly, but will still be compatible with the old behavior (where "includeTasks=true" and "onlyFailed=false"). When these new query parameters are used with "includeTasks" and/or "onlyFailed" set to true, a successful response will be 202 ACCEPTED
, signaling that the request to restart some subset of the Connector
and/or Task
instances was accepted and will be asynchronously performed.
...
.
...
202 ACCEPTED
when the named connector exists and the server has successfully and durably recorded the request to stop and begin restarting at least one failed or runningConnector
object andTask
instances (e.g., "includeTasks=true" or "onlyFailed=true"). A response body will be returned, and it is similar to theGET /connector/{connectorName}/status
response except that the "state" field is set to RESTARTING for all instances that will eventually be restarted.204 NO CONTENT
when the named connector exists and the server has successfully stopped and begun restarting only theConnector
object (e.g., "includeTasks=false" and "onlyFailed=false"). No response body will be returned (to align with the existing behavior).404 NOT FOUND
404 NOT FOUND
when the named connector does not exist.409 CONFLICT
when a rebalance is needed, forthcoming, or underway while restarting any of theConnector
and/orTask
objects; the reason may mention that the Connect cluster’s leader is not known, or that the worker assigned the Connector cannot be found.500 Internal Server Error
when the request timed out (takes more than 90 seconds), which means the request could not be durably recorded, perhaps because the worker or cluster are shutting down or because the worker receiving the request has temporarily lost contact with the Kafka cluster.
...
The 202 ACCEPTED
response signifies that the “restart request” has been durably written to the config topic and all the workers in the Connect cluster will (eventually) see the restart request. If the worker reads the restart request as part of worker startup, it can ignore the restart request since the worker will subsequently attempt to start all of its assigned Connector
and Task
instances, effectively achieving the goal of restarting the instances assigned to that worker. If the worker reads the restart request after worker startup, then the DistributedHerder
will enqueue the request to be processed within its next tick()
invocation. As part of the tick()
methodthe herder's main thread loop. During this main thread loop, the herder will dequeue all pending restart requests and for each request use the current connector status and the herder’s current assignments to determine which of its Connector
and Task
instances are to be restarted, and will then stop and restart them. Note that because this is done within the tick()
methodmain thread loop, the herder will not concurrently process any assignment changes while it is executing the restart requests.
The “restart request” written to the config topic, which already is where the connector and task config records, task state change records, and session key records are written. This topic also make sense since all records related to restarts and configuration changes are totally ordered, and are all processed within the herder's `tick()` methodmain thread loop. The "restart request" records will not conflict with any other types of config records, will be compatible with the compacted topic, and will look like:
...
On the other hand, the current approach is more reliable, since once the restart request is written to the config topic it will be eventually consumed by all workers. The current proposal also builds upon and reuses much more of the existing functionality in the worker, making the overall implementation more straightforward. There is also no chance for changing worker assignments to interfere with the restarts, since the current approach performs the restarts during the same herder tick method that thread loop that reacts to all rebalance changes. And, the new approach is more efficient, as some restart requests can be ignored if the worker will subsequently (re)start its assigned instances. For example, if a restart for a connector is requested but one of the worker is itself restarted (or joins the cluster), the worker as part of startup will start all of its assigned Connector
and Task
instances, making the restart unnecessary.
...