Status

Current stateUnder Discussion

Discussion thread

JIRA:

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

While Kafka Connect's primary use-case is probably to move data in and out of Kafka, it is also useful for building pipelines where Kafka is just incidental. We can move data between Sources and Sinks leveraging Kafka's durability, even when the resulting Kafka topics are not otherwise used. In such scenarios, it doesn't make sense to keep data around in Kafka longer than necessary for the pipeline.

In order to reduce the storage space required for these pipelines, we could use a shorter retention period. However, there is no way to guarantee that SinkConnectors are able to keep up with the retention period. This yields a problematic and unnecessary trade-off between storage space and durability.

Instead, the Connect runtime could automatically delete records it knows it has already processed. This would be challenging for individual SinkConnector implementations to handle, since multiple SinkConnectors could be reading the same topics, and we don't want one SinkConnector to truncate a topic that another SinkConnector is still reading. Since the Connect runtime knows the state of all Connectors, it is safe for the runtime to delete records once they are read by all SinkConnectors.

Public Interfaces

We'll add a pair of new Worker configuration properties:

Proposed Changes

When delete.committed is enabled, each Worker will asynchronously truncate topic-partitions after they are flushed. To avoid deleting records prematurely, the following conditions must be met:

The frequency at which Workers truncate topics will be influenced primarily by offset.flush.interval, but deletion is not guaranteed to occur at this frequency.

Compatibility, Deprecation, and Migration Plan

This feature must be explicitly enabled.

Rejected Alternatives