Status

Current state: Under Discussion

Discussion thread: here

JIRA:   

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

The Remote Log Manager a.k.a. RLM is one of the critical components in Kafka that helps support tiered storage. One of its responsibilities is to periodically upload rolled-over log segments on the local disk to the remote storage. It discovers the log segments eligible to be uploaded and then uploads them. RLM creates one such task (RLMTask) for each topic partition managed by the broker and schedules them to run at a fixed frequency (of 30 secs) using a ScheduledThreadPoolExecutor.

When an RLMTask identifies a list of segments to upload, it tries to upload all segments in the same run. As a result, if there are a huge number of eligible segments, the RLM task processes them as quickly as possible. Since multiple such tasks are running concurrently within the thread-pool executor, they end up consuming significant CPU, thereby affecting producer latencies. This phenomenon is often noticed when we enable tiered storage for existing topics on a cluster. 

Controlling the rate of segment uploads to remote storage would prevent over-utilization of CPU and cluster degradation. 

Similarly, the RLM also plays a role in serving read requests for log segments residing in the remote storage. When receiving a request for remote read, the RLM submits an async read task to its internal thread pool executor. The async task reads data from the remote storage that is then sent back in the fetch response. 

A large number of read requests requesting data residing on the remote storage could cause degradation of the remote storage. The large number of read requests may also cause over-utilization of CPU on the broker. This may happen when the majority of the consumers start reading from the earliest offset of their respective Kafka topics. To prevent such degradation, we should have a mechanism to control the rate at which log segments are read from the remote storage. 

Goals

Public Interfaces

Broker Configuration Options

New broker properties will be added to configure quotas for both reading for remote storage as well as writing to the remote storage. We will be using the existing quota framework to keep track of the current read and write rates. 

Configuration for write quotas:

Configuration for read quotas:

Proposed Changes

This is an enhancement to the existing mechanism for reading/writing log segments from/to the remote storage. In the absence of a throttling mechanism, the remote storage can degrade when many fetch requests request data from the remote storage or when a large amount of writes happen to the remote storage. Also without such a mechanism, the CPU on the brokers would become overly busy causing degradation of the Kafka cluster and impacting the producer latencies.

Throttling Data Writes

We add a new component, RLM WriteQuotaManager to manage quotas for remote writes. It is similar to other existing QuotaManagers (for eg. ClientQuotaManager). It can be configured with the desired write quota. It keeps track of the current usage and can be used to check if the quota is exhausted. We will use it to record the segment size when a segment is uploaded to the remote, so it can track the rate of bytes upload. We can use it to check if the current upload rate is within the quota before uploading new segments to the remote.

RLM  WriteQuotaManager

The WriteQuotaManager allows users to specify the following properties:

The QuotaManager utilizes a rolling window to track the bytes uploaded every second. There is a time bucket for each sample and it stores the total bytes uploaded in the time bucket.

T1T2...T60T61

61 buckets of size 1 second (default). There are 60 whole buckets and one additional bucket is to track usage for the current window. 


QuotaManager supports the following two operations:

The WriteQuotaManager is shared by all the threads of the ScheduledThreadPoolExecutor that is used to execute RLMTasks and keeps track of the total segment upload rate on the broker. The RLMTask will do the following to ensure the remote write quota specified for the cluster is respected:

An RLMTask is also responsible for handling expired remote log segments for the associated topic partition. It cleans up those remote log segments that are expired and are no longer required. With the change in the behavior to block if the remote write quota is exhausted, clean-up of expired segments may get affected and may get stalled if the segment upload rate across the cluster is high causing excessive throttling. To solve this problem, we can break down the RLMTask into two smaller tasks - one for segment upload and the other for handling expired segments. The two tasks shall be executed in separate ThreadPoolExecutors. This will remove the impact of the throttling of segment uploads on segment expiration.

Throttling Data Reads

We shall add a new component RLM ReadQuotaManager to manage quotas for remote reads. It is similar to the WriteQuotaManager discussed in the previous section, but is meant for tracking reads instead of writes. It can be configured with the permitted read quota. It can then be used to record the quota consumed following a data read and also to check if the quota has already been exceeded.

RLM  ReadQuotaManager

The ReadQuotaManager allows users to specify the following properties:

When a fetch request requires remote data for a topic partition, the ReplicaManager will query the ReadQuotaManager to check whether the quota for reading remote data has already been exhausted. If the quota has not been exhausted, the ReplicaManager will continue to return the RemoteStorageFetchInfo that can be used to read the requested data from the remote storage. Once data from the remote storage is read, RLM records the number of bytes read with the ReadQuotaManager. Hence, the quota manager always has an updated view of quota utilization.

However, if the quota has already been exhausted, the ReplicaManager returns an empty RemoteStorageFetchInfo. As a result, the fetch request can still read data for the other topic partitions, but there will be no data for the topic partition requiring remote data as long as the read quota has been exhausted. Once the read rate falls below the permitted quota, fetch requests can resume serving remote data.

It is worth noting that there is a possibility that a rogue client can try to read large amounts of data from remote storage, exhausting the broker-level remote read quota. Once the quota is exhausted, all remote reads will be throttled, which will even prevent other polite clients from reading remote data. To solve for this, we can set client-level fetch quotas (should be lesser than broker-level remote read quota). This will prevent the rogue client from reading large amounts of remote data, which in turn will prevent the throttling of remote reads for other clients.

Test Plan

Unit tests for both read and write quotas

Rejected Alternatives

Throttling Data Writes

In this section, we discuss some other approaches for throttling remote writes.

Pros:

Cons:

Cons:

The alternate architectures are rejected for the disadvantages discussed above.

Throttling Data Reads

In this section, we discuss some other approaches for throttling remote reads.

Cons:

The drawback of the above two approaches is that if the remote read quota has been exhausted, the RLM will keep accepting more read tasks to be executed later. The fetch request corresponding to the read task may have already timed out by the time the read task gets executed. This will lead to a waste of resources for the broker. 

The alternate architectures are rejected for the disadvantages discussed above.