Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Correct response from restart API in AK 2.8 and earlier

Table of Contents

Status

Current state: Under Adopted

Discussion thread: here

Discussion Vote thread: here

JIRA

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyKAFKA-4793

...

The response of this method will be changed slightly, but will still be compatible with the old behavior (where "includeTasks=true" and "onlyFailed=false"). When these new query parameters are used with "includeTasks" and/or "onlyFailed" set to true, a successful response will be 202 ACCEPTED, signaling that the request to restart some subset of the Connector and/or Task instances was accepted and will be asynchronously performed.

...

.

...

  • 202 ACCEPTED when the named connector exists and the server has successfully and durably recorded the request to stop and begin restarting at least one failed or running Connector object and Task instances (e.g., "includeTasks=true" or "onlyFailed=true"). A response body will be returned, and it is similar to the GET /connector/{connectorName}/status response except that the "state" field is set to RESTARTING for all instances that will eventually be restarted.

  • 204 NO CONTENT when the named connector exists and the server has successfully stopped and begun restarting only the Connector object (e.g., "includeTasks=false" and "onlyFailed=false"). No response body will be returned (to align with the existing behavior).

  • 404 NOT FOUND 404 NOT FOUND when the named connector does not exist.

  • 409 CONFLICT when a rebalance is needed, forthcoming, or underway while restarting any of the Connector and/or Task objects; the reason may mention that the Connect cluster’s leader is not known, or that the worker assigned the Connector cannot be found.

  • 500 Internal Server Error when the request timed out (takes more than 90 seconds), which means the request could not be durably recorded, perhaps because the worker or cluster are shutting down or because the worker receiving the request has temporarily lost contact with the Kafka cluster.

...

The 202 ACCEPTED response signifies that the “restart request” has been durably written to the config topic and all the workers in the Connect cluster will (eventually) see the restart request. If the worker reads the restart request as part of worker startup, it can ignore the restart request since the worker will subsequently attempt to start all of its assigned Connector and Task instances, effectively achieving the goal of restarting the instances assigned to that worker. If the worker reads the restart request after worker startup, then the DistributedHerder will enqueue the request to be processed within its next tick() invocation. As part of the tick() methodthe herder's main thread loop. During this main thread loop, the herder will dequeue all pending restart requests and for each request use the current connector status and the herder’s current assignments to determine which of its Connector and Task instances are to be restarted, and will then stop and restart them. Note that because this is done within the tick() methodmain thread loop, the herder will not concurrently process any assignment changes while it is executing the restart requests.

The “restart request” written to the config topic, which already is where the connector and task config records, task state change records, and session key records are written. This topic also make sense since all records related to restarts and configuration changes are totally ordered, and are all processed within the herder's `tick()` methodmain thread loop. The "restart request" records will not conflict with any other types of config records, will be compatible with the compacted topic, and will look like:

...

On the other hand, the current approach is more reliable, since once the restart request is written to the config topic it will be eventually consumed by all workers. The current proposal also builds upon and reuses much more of the existing functionality in the worker, making the overall implementation more straightforward. There is also no chance for changing worker assignments to interfere with the restarts, since the current approach performs the restarts during the same herder tick method that thread loop that reacts to all rebalance changes. And, the new approach is more efficient, as some restart requests can be ignored if the worker will subsequently (re)start its assigned instances. For example, if a restart for a connector is requested but one of the worker is itself restarted (or joins the cluster), the worker as part of startup will start all of its assigned Connector and Task instances, making the restart unnecessary.

...