Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We create one more checkpoint file, named "log-begin-offset-checkpoint", in every log directory. The checkpoint file will have the same format as existing checkpoint files (e.g. replication-offset-checkpoint) which map TopicPartition to Long.

4) Script

Add kafka-purge-data.sh that allows user to purge data in the command line. The script requires for the following arguments:

- bootstrap-server. This config is required from user. It is used to identify the Kafka cluster.
- command-config. This is an optional property file containing configs to be passed to Admin Client.
- purge-offset-json-file. This config is required from user. It allows user to specify offsets of partitions to be purged. The file has the following format:

 

Code Block
{
  "version" : int,
  "partitions" : [
    {
      "topic": str,
      "partition": int,
      "offset": long
    },
    ...
  ]
}


Proposed Changes

The idea is to add new APIs in Admin Client (see KIP-4) that can be called by user to purge data that is no longer needed. New request and response needs to be added to communicate this request between client and broker. Given the impact of this API on the data, the API should be protected by Kafka’s authorization mechanism described in KIP-11 to prevent malicious or unintended data deletion. Furthermore, we adopt the soft delete approach because it is expensive to purge data in the middle of a segment. Those segments whose maximum offset < offset-to-purge can be deleted safely. Brokers can increment log_start_offset of a partition to offset-to-purge so that data with offset < offset-to-purge will not be exposed to consumer even if it is still on the disk. And the log_start_offset will be checkpointed periodically similar to high_watermark to be persistent. 

...