Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Status

Current stateUnder Discussion

...

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Kafka can be used in a stream processing pipeline to pass intermediate data between processing jobs. The amount of intermediate data generated from stream processing jobs can taken a large amount of disk space in the Kafka. It is important that we can delete this data soon after it is consumed by downstream application, otherwise we have to pay significant cost to purchase disks for Kafka clusters to keep those data.

However, Kafka doesn’t provide any mechanism to delete data after data is consumed by downstream jobs. It provides only time-based and size-based log retention policy, both of which are agnostic to consumer’s behavior. If we set small time-based log retention for intermediate data, the data may be deleted even before it is consumed by downstream jobs. If we set large time-based log retention, the data will take large amount of disk space for a long time. Neither solution is good for Kafka users. To address this problem, we propose to add a new admin API which can be called by user to purge data that is no longer needed.


Public Interfaces

1) Java API

- Add the following API in Admin Client. This API returns a future object whose result will be available within RequestTimeoutMs, which is configured when user constructs the AdminClient.

...

PurgeDataResult(long: low_watermark, error: Exception)

2) Protocol

Create PurgeRequest

 

Code Block
titlePurgeRequest
PurgeRequest => topics
  topics => [PurgeRequestTopic]
  timeout => int32
 
PurgeRequestTopic => topic partitions
  topic => str
  partitions => [PurgeRequestPartition]
 
PurgeRequestPartition => partition offset
  partition => int32
  offset => int64

 

Create PurgeReponse

...

Code Block
titleFetchResponsePartitionHeader
FetchResponsePartitionHeader => partition error_code high_watermark low_watermark
  partition => int32
  error_code => int16
  high_watermark => int64
  low_watermark => int64  <-- NEW. This is the low_watermark of this partition on the leader.

Proposed Changes

The idea is to add new APIs in Admin Client (see KIP-4) that can be called by user to purge data that is no longer needed. New request and response needs to be added to communicate this request between client and broker. Given the impact of this API on the data, the API should be protected by Kafka’s authorization mechanism described in KIP-11 to prevent malicious or unintended data deletion. Furthermore, we adopt the soft delete approach because it is expensive to purge data in the middle of a segment. Those segments whose maximum offset < offset-to-purge can be deleted safely. Brokers can increment low_watermark of a partition above offset-to-purge so that data with offset < offset-to-purge will not be exposed to consumer even if it is still on the disk. And the low_watermark will be checkpointed periodically similar to high_watermark to be persistent. 

...

Please refer to public interface section for our design of the API, request and response. In this section we will describe how broker maintains low watermark per partition, how client communicates with broker to purge old data, and how this API can be protected by authorization.

1) Interaction between user application and brokers

1) User application determines the maximum offset of data that can be purged per partition. This information is provided to purgeDataBefore() as Map<TopicPartition, Long>. If users application only knows timestamp of data that can be purged per partition, they can use offsetsForTimes() API to convert the cutoff timestamp into cutoff offset per partition before providing the map to purgeDataBefore() API.

...

9) If admin client does not receive PurgeResponse from a broker within RequestTimeoutMs, the PurgeDataResult of the partitions on that broker will bePurgeDataResult(low_watermark = -1, error = TimeoutException). Otherwise, the PurgeDataResult of each partition will be constructed using the low_watermark and the errorof the corresponding partition which is read from the PurgeDataResponse received from its leader broker. purgeDataBefore(...).get() will unblock and returnMap<TopicPartition, PurgeDataResult> when PurgeDataResult of all partitions specified in the offsetForPartition param are available.

2) Routine operation in the broker

- Broker will delete those segments whose largest offset < low_watermark.

...

- Broker will checkpoint low_watermark for all replicas periodically, in the same way it checkpoints high_watermark of replicas.

3) API Authorization

Given the potential damage that can be caused if this API is used by mistake, it is important that we limit its usage to only authorized users. For this matter, we can take advantage of the existing authorization framework implemented in KIP-11purgeDataBefore() will have the same authorization setting as deleteTopic(). Its operation type is be DELETE and its resource type is TOPIC.

...

 The KIP changes the inter-broker protocol. Therefore the migration requires two rolling bounce. In the first rolling bounce we will deploy the new code but broker will still communicate using the existing protocol. In the second rolling bounce we will change the config so that broker will start to communicate with each other using the new protocol.

Test Plan

- Unit tests to validate that all the individual components work as expected.
- Integration tests to ensure that the feature works correctly end-to-end. 

Rejected Alternatives


- Using committed offset instead of an extra API to trigger data purge operation. Purge data if its offset is smaller than committed offset of all consumer groups that need to consume from this partition.
The advantage of this approach is that it doesn't need coordination of user applications to determine when purgeDataBefore() can be called, which can be hard to do if there are multiple consumer groups interested in consuming this topic. The disadvantage of this approach is that it is less flexible than purgeDataBefore() API because it re-uses committed offset to trigger data purge operation. Also, it adds complexity to broker implementation and would be more complex to implement than the purgeDataBefore() API. An alternative approach is to implement this logic by running an external service which calls purgeDataBefore() API based on committed offset of consumer groups.

Leader sends PurgeResponse without waiting for low_watermark of all followers to increase above the cutoff offset
This approach would be simpler to implement since it doesn't require DelayedOperationPurgatory for PurgeRequest. The leader can reply to PurgeRequest faster since it doesn't need to wait for followers. However, the purgeDataBefore() API would provide weaker guarantee in this approach because the data may not be deleted if the leader crashes right after it sends PurgeResponse. It will be useful to know for sure whether the data has been deleted, e.g. when user wants to delete problematic data from upstream so that downstream application can re-consume clean data, or if user wants to delete some sensitive data.

- Purge data on only one partition by each call to purgeDataBefore(...)
This approach would make the implementation of this API simpler, and would be consistent with the existing seek(TopicPartition partition, long offset) API. The downside of this approach is that it either increases the time to purge data if the number of partitions is large, or it requires user to take extra effort to parallelize the purgeDataBefore(...). This API may take time longer than seek() for a given partition since the broker needs to wait for follower's action before responding to PurgeDataRequest. Thus we allow user to specify a map of partitions to make this API easy to use.