Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

We When reading from remote storage, we are reusing the the fetch.max.wait.ms config as a delay timeout for DelayedRemoteFetchPurgatory. fetch.max.wait.ms purpose is to wait for the given amount of time when there is no local data available to serve the FETCH request.

Using the same timeout in the DelayedRemoteFetchPurgatory can confuse the user on how to configure the optimal value for each purpose. Moreover, the config is of LOW importance and most of the users won't configure it and use the default value of 500 ms. Having the delay timeout of 500 ms in DelayedRemoteFetchPurgatory can lead to higher number of expired delayed remote fetch requests when the remote storage have any degradation.

Public Interfaces

A new dynamic broker configuration: fetch.remote.max.wait.ms will be added and the delayed remote fetch purgatory will wait up to this timeout to fetch the data from the remote storage.

Proposed Changes

Remote storage read latencies are non-deterministic. Suppose the user configures 20 remote-log-reader threads, and it takes 100 ms to serve one request. When a backfill job runs and reads data from the head of the log for multiple partitions (let's say 1000), the remote-fetch requests get queued, potentially exceeding the default timeout of 500 ms. Additionally, the time taken to serve the P99 remote storage fetch requests can extend into seconds. We propose introducing a new timeout parameter, fetch.remote.max.wait.ms, to offer users the option to configure the timeout based on their workload.

Under the current behavior, when the remote-fetch request times out, the client receives an empty response and retries by sending the FETCH request for the same fetch-offset. This process adds further pressure to the remote storage, as it involves fetching the same data and interrupting the thread before completion.

fetch.remote.max.wait.ms:

This parameter represents the maximum time the server will block before responding to the remote fetch request. Note that this timeout should be set to a value less than request.timeout.ms; otherwise, the requests will time out on the client side. The default value is configured to be 500 ms.

Setting the default value to 500 ms ensures that the client will always receive a response within 500 ms to prevent any surprises on the client side. It's important to note that there is no guarantee that the FETCH request will always be served within 500 ms. The time taken to serve the FETCH request can surpass the fetch.max.wait.ms due to factors such as a slow hard disk, sector errors in the disk, and so on.

Why does the configuration need to be dynamic?

By maintaining the flexibility to update this configuration dynamically, operators have more control over when to increase the timeout in case of any remote storage degradation. If there is enough remote data (max.partition.fetch.bytes) to respond, the remote-fetch request will be returned immediately.

Compatibility, Deprecation, and Migration Plan

...

Space shortcuts

Child pages

Versions Compared

Old Version 3

New Version 4

Key

Motivation

Public Interfaces

Proposed Changes

Compatibility, Deprecation, and Migration Plan

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 3

New Version 4

Key

Motivation

Public Interfaces

Proposed Changes

Compatibility, Deprecation, and Migration Plan