Status
Current state: Under Discussion
Discussion thread: here
JIRA: here
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
Kafka Connect currently defines a default REST API request timeout of 90 seconds which isn't configurable. If a REST API request takes longer than this, a 500 Internal Server Error
response is returned with the message "Request timed out". In exceptional scenarios, a longer timeout may be required for operations such as connector config validation or connector creation / updation (both of which internally do a config validation first) to complete successfully. Consider a database / data warehouse connector which has elaborate validation logic involving querying information schema to get a list of tables/views to validate the user's connector configuration. If the database / data warehouse has a very high number of tables / views and the database / data warehouse is under a heavy load in terms of query volume, such information schema queries can end up taking longer than 90 seconds which will cause connector config validation / creation REST API calls to timeout.
Public Interfaces
This KIP proposes to add a new Kafka Connect worker configuration - rest.api.request.timeout.ms
which will default to the existing REST API request timeout of 90 seconds.
Proposed Changes
The value of the new worker config rest.api.request.timeout.ms
will be read in the RestServer class and will be used to configure the request timeout of each of its resources (each resource essentially represents a group of related Connect REST APIs under a common top level path) via ConnectResource::requestTimeout. Note that this doesn't change how long requests actually run in the herder - currently, if a request exceeds the default timeout of 90 seconds we simply return with the 500 response but the request isn't interrupted or cancelled and is allowed to continue to completion. Furthermore, each connector config validation is anyway done on its own thread via a cached thread pool executor in the herder (create / update connector calls are done asynchronously by simply writing a record to the Connect cluster's config topic, so config validations are the only relevant operation here).
Compatibility, Deprecation, and Migration Plan
The proposed changes are fully backward compatible since we're just introducing a new worker config for REST API request timeouts which will default to the existing REST API request timeout of 90 seconds.
Test Plan
A simple integration test will be added to ensure that a validate REST API request for a connector that takes longer than the default REST API request timeout (90 seconds) doesn't fail on a worker configured with rest.request.timeout.ms
set to a higher value.
Rejected Alternatives
Allow configuring timeouts for each REST resource
Summary: The Kafka Connect REST server initializes multiple "resources" including the ConnectorsResource
(serving APIs with the path /connectors
) and the ConnectorPluginsResource
(serving APIs with the path /connector-plugins
) among others. We could allow configuring the request timeouts for each of these resources individually via separate Connect worker properties.
Rejected because: This would require the introduction of multiple new Kafka Connect worker properties with negligible additional value.
Allow configuring timeouts for ConnectClusterStateImpl
Summary: Currently, ConnectClusterStateImpl
is configured in the RestServer
and passed to REST extensions via the context object (see here). ConnectClusterStateImpl
takes a request timeout parameter for its operations such as list connectors and get connector config (implemented as herder requests). This timeout is set to the minimum of ConnectResource.DEFAULT_REST_REQUEST_TIMEOUT_MS
(90 seconds) and DistributedConfig.REBALANCE_TIMEOUT_MS_CONFIG
(defaults to 60 seconds). We could use the value of the new worker config proposed in this KIP instead of ConnectResource.DEFAULT_REST_REQUEST_TIMEOUT_MS
in the minimum calculation.
Rejected because: The overall behavior would be confusing to end users (they'll need to tweak two configs to increase the overall timeout) and there is seemingly no additional value here (as the herder requests should not take longer than the current configured timeout anyway).
Allow configuring producer zombie fencing admin request timeout
Summary: ConnectResource.DEFAULT_REST_REQUEST_TIMEOUT_MS
is also used as the timeout for producer zombie fencings done in the worker for exactly once source tasks (see here). We could allow configuring this as well via the new worker config proposed in this KIP.
Rejected because: Zombie fencing is an internal operation for Kafka Connect and users shouldn't be able to configure it.