Table of Contents |
---|
Status
Current state: Under Discussiondiscussion
Discussion thread: TODO here
JIRA: here
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
...
- Creating the offsets file if it did not already exist
- Reading and parsing the offsets file
- Started Starting all connectors whose configs were specified on the command line
- Generated Generating task configs for all of these connectors
- Started Starting tasks for all of these connectors
...
- Remaining in contact with the group coordinator for the cluster
- Read Reading to the end of the config topic after a rebalance
- If exactly-once support for source connectors is enabled and the worker is the leader of the cluster (see KIP-618: Exactly-Once Support for Source Connectors), instantiating a transactional producer for the config topic
- If session keys are enabled (see KIP-507: Securing Internal Connect REST Endpoints), writing a new session key to the config topic
...
Code Block | ||||
---|---|---|---|---|
| ||||
{ "status": "healthy", "message": "Worker has completed startup and is healthy ready to handle requests" } |
This endpoint will only return the above if the If the worker has not yet completed startup, the response will have a 503 status code and is capable of serving REST requests.its body will have a different message:
Code Block | ||||
---|---|---|---|---|
| ||||
{
"status": "starting",
"message": "Worker is still starting up"
} |
If the worker has completed startup but is unable to respond in time, the response will have a 500 status code and its body will have this message:
Code Block | ||||
---|---|---|---|---|
| ||||
{ "status": "unhealthy", "message": "Worker was unable to handle this request and may be unable to handle other requests" } |
Unlike other endpoints, the timeout for the health check endpoint will not be 90 seconds. If a consecutive number of N failures reported by this endpoint is required before automated tooling declares the worker unhealthy, then waiting N * 1.5 minutes for an issue with worker health to be detected is likely to be too long. Instead, the timeout for this endpoint will be 10 seconds. In the future, the timeout may be made user-configurable if, for example, KIP-882: Kafka Connect REST API configuration validation timeout improvements or something like it is adopted, the request may hang for a while, before possibly being met with a 4XX or 5XX response. The exact details of this are not made explicit here, as anything except a 200 response should be considered indicative that the worker is not healthy.
Proposed Changes
Distributed mode
...
This change should be fully backwards compatible. Users who already have their own strategies for monitoring worker health/liveness can continue to employ them. Users who would like to use the endpoint introduced by this KIP need only upgrade to a newer version of Kafka Connect.
Test Plan
Tests should be fairly lightweight. We will cover the three possible states (starting, healthy, unhealthy) in the two different modes (standalone, distributed) through either integration or system testing.
Rejected Alternatives
Reuse existing endpoints
Summary: reuse the existing GET /
endpoint, or a similar endpoint, as the health check endpoint.
Rejected because:
- It's not clear that users are already employing this endpoint as a health check endpoint, so it's not guaranteed that this would ease the migration path significantly
- Altering the semantics for existing endpoints may have unanticipated effects on existing deployments (users may want to be able to query the version of the worker via the REST API as soon as the worker process has started, without waiting for the worker to complete startup–this may be especially valuable if the worker is unhealthy and that information is not easily accessible otherwise)
- It's easier to add new, user-friendly response bodies that cover the worker state, without worrying about breaking usages of existing endpoints.
- By using a new endpoint for health checks, users can be guaranteed that the endpoint comes with the desired semantics; if an existing endpoint were reused, it would be unclear (without prior knowledge of worker version and Kafka Connect changelog history) whether it would be suitable for health checks
Delay REST endpoint availability until worker has completed startup
Summary: issue 503 responses from all existing endpoints (including GET /
, GET /connectors
, POST /connectors
, etc.) if the worker hasn't completed startup yet, or delay responses from these endpoints until the worker has completed startup (using the existing 90 second timeout)
Rejected because:
- All concerns outlined above for reusing existing endpoints as health check endpoints apply
- This does not cover worker liveness, and only covers worker readiness