Table of Contents

Status

Current state: DraftUnder Discussion

Discussion thread: here

JIRA:

Jira

server	ASF JIRA
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	KAFKA-15265

...

The Remote Log Manager a.k.a. RLM is one of the critical components in Kafka that helps support tiered storage. One of its responsibilities is to periodically upload rolled-over log segments on the local disk to the remote storage. It discovers the log segments eligible to be uploaded and then uploads them. RLM creates one such task (RLMTask) for each topic partition managed by the server broker and schedules them to run at a fixed frequency (of 30 secs) using a ScheduledThreadPoolExecutor.

When an RLMTask identifies a list of segments to upload, it tries to upload all segments in the same run. As a result, if there are a huge number of eligible segments, the RLM task processes them as quickly as possible. Since there are multiple such tasks are running concurrently within the thread-pool executor, it ends they end up consuming significant CPU, thereby affecting producer latencies. This phenomenon is often noticed when we enable tiered storage for existing topics on a cluster.

...

Similarly, the RLM also plays a role in serving read requests for log segments residing in the remote storage. When receiving a request for remote read, the RLM submits an async read task to its internal thread pool executor. The async task reads data from the remote storage which that is then sent back in the fetch response.

A large number of read requests requesting data residing on the remote storage could cause degradation of the remote storage. The large number of read requests may also cause over-utilization of CPU on the broker. This may happen when the majority of the consumers start reading from the earliest offset of their respective Kafka topics. To prevent such degradation, we should have a mechanism to control the rate at which log segments are read from the remote storage.

Goals

The administrator should be able to configure an "upper bound" on the rate at which log segments are uploaded to the remote storage.
The administrator should be able to configure an "upper bound" on the rate at which log segments are read from the remote storage.

...

We add a new component, RLM WriteQuotaManager to manage quotas for remote writes. It is similar to other existing QuotaManagers (for eg. ClientQuotaManager). It can be configured with the desired write quota. It keeps track of the current usage and can be used to check if the quota is exhausted. We will use it to record the segment size when a segment is uploaded to the remote, so it can track the rate of bytes upload. We can use it to check if the current upload rate is within the quota before uploading new segments to the remote.

...

61 buckets of size 1 second (default). There are 60 whole buckets and one additional bucket is to track usage for the current window.

QuotaManager supports the following two operations:

...

In each run, RLMTask identifies all the log segments for the given topic partition that are eligible for upload. The task then attempts to upload the segments to remote storage in a loop.
Before uploading the next log segment, it checks whether the write quota has already been violated. If the quota has not been violated, it first uploads the log segment to the remote storage and then records the number of bytes uploaded with the WriteQuotamanager. Thus, the quota manager can see the updated view of quota utilization.
If the quota is already exhausted, the RLMTask waits until the write rate falls below the specified quota. Once the write rate falls, the task uploads the segment, records the number of bytes uploaded with the WriteQuotaManager, and moves on to the next segment.
This approach may cause starvation for low throughput topics, since the RLM task for high throughput topics may not give up the thread (the task waits tills the write quota falls below the quota). Starvation may not be a problem, because RLM is still running at the maximum capacity to offload segments to remote, thus preventing the local disk from growing. However, if fairness is desirable, the RLM task should exit if runs into 'quota exceeded error' and it has uploaded at least one segment in its run. This will allow other RLM tasks a chance to be executed. They may run into the same error but will run once the quota utilization subsides.

An RLMTask is also responsible for handling expired remote log segments for the associated topic partition. It cleans up those remote log segments that are expired and are no longer required. With the change in the behavior to block if the remote write quota is exhausted, clean-up of expired segments may get affected and may get stalled if the segment upload rate across the cluster is high causing excessive throttling. To solve this problem, we can break down the RLMTask into two smaller tasks - one for segment upload and the other for handling expired segments. The two tasks shall be executed in separate ThreadPoolExecutors. This will remove the impact of the throttling of segment uploads on segment expiration.

...

We can decrease the pool size for the ThreadPoolExecutor. This would prevent too many concurrent uploads from running thus preventing over-utilization of CPU.

Pros:

- Simplicity. No new feature needs to be built. We will only need to adjust the thread pool size which can be done dynamically.

Cons:

- This approach relies on reducing concurrency to achieve lower upload speed. In practice, we would know what the maximum upload rate our remote storage can support. It is not straightforward to translate this to the concurrency of upload threads and requires hit-and-trial approach.
- Reducing concurrency can also introduce unfairness while uploading segments. When an RLMTask runs for a topic partition, it uploads all the eligible log segments in the same run preventing uploads for other topic partitions This can cause delays and lag buildup for other topic partitions, particularly when the thread pool size is small. If some topics have a high message-in rate, the corresponding RLMTasks would end up using all threads in the threadpool preventing uploads for smaller topics.
We could use the QuotaManager differently. Instead of tracking the global upload rate, we could track the upload rate per thread in the writer threadpool. Each thread records its upload speed with the quota manager and checks for quota violations before uploading the next log segment.

...

- It requires a hit-and-trial approach to set the threadpool size appropriately so as not to exceed a certain read rate from the remote storage.
- The setting is dependent on the broker hardware and needs to be tuned accordingly.
We could use the QuotaManager differently. It will be used by the worker threads in the reader threadpool to check if the read quota has been exceeded. If it isn’t exceeded, the read task is processed. Otherwise, the read task must wait. This can be implemented in two ways:
- Read Task computes the throttle time using the quota framework and sleeps for that amount of time before executing the request. When the task wakes up, the read rate would have fallen within the specified quota.
- Another The drawback of this approach is that even though the threads in the threadpool could be waiting, they would look all busy. This would create confusion for the Kafka Administrator.To avoid the above problem, instead of waiting, the RLMTask execution can be deferred and can be scheduled to run with some delay, ie i.e. throttle time.

This approach of delaying the remote read task however comes with a drawback that consumer fetch request gets stalled during the waiting period. The fetch request could have been served with data for other partitions in the request that do not need remote data, allowing the consumer to make progress. However, because we block the reader thread, no data can be served while the thread is blocked.The drawback of the above two approaches is that if the remote read quota has been exhausted, the RLM will keep accepting more read tasks to be executed later. The fetch request corresponding to the read task may have already timed out by the time the read task gets executed. This will lead to a waste of resources for the broker.

We could use the throttle time (computed from the quota framework) to throttle the client instead. The throttle time can be propagated back in the response payload and we use it to throttle further requests from the client. However, this approach prevents the client from reading any partition even though it may not request data for a partition with remote data.

...

Space shortcuts

Child pages

Versions Compared

Old Version 14

New Version Current

Key

Status

Goals

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 14

New Version Current

Key

Status

Goals