Status

Current state: Accepted (vote)

Discussion threadhttps://lists.apache.org/thread.html/67fcfe37169bdbabdbecc30686ccba0f5f27e193c468a1fe5d0062ed@%3Cdev.kafka.apache.org%3E

Old Discussion threadhttps://lists.apache.org/thread.html/f44317eb6cd34f91966654c80509d4a457dbbccdd02b86645782be67@%3Cdev.kafka.apache.org%3E

JIRA

PULL REQUEST: https://github.com/apache/kafka/pull/8103

Motivation

Current log compaction is based on the server side view i.e. compacted based on record offset and the offset is by the order when the record was received on the broker side. So for the same key, only the highest offset record is kept after compaction so that Kafka is able to reconstruct the current state of the events in a "most recent snapshot" approachThe issue then occurs when the insertion order is not guaranteed, which causes the log compaction to keep the wrong state. This can be easily replicated when using a multi-threaded (or simply multiple) producer(s), or when sending the events asynchronously. The following is an example:

Producer 1 tries to send a message <K1, V1> to topic A partition p1. Producer 2 tries to send a message <K1, V2> to the same (i.e. topic A partition p1). On the producer side, we clearly preserve an order for the two messages, <K1, V1> <K1, V2>. But on the server side, this order can be random, meaning, message <K1, V1> could have a higher offset due to the fact this message is received later than <K1, V2>. When compaction happens, <K1, V1> will be kept, and clearly this is not what is intended.

To resolve the above issue, we are proposing to add a feature to support compaction based on producer signal (i.e. adding 2 more compaction strategies, record timestamp and header sequence/version) and keeping the current compaction (i.e. offset based) as the default compaction for the backward compatibility. By this way, the producer will have an option to own and control the record ordering. As the log compaction is at the topic level and a broker can have multiple topics, keeping the compaction strategy configuration at topic level will be ideal. As the proposed configuration is at the topic level, the user can choose to enable a different compaction strategy for a subset of compact topics or at the broker level for all topics within the broker. While this proposal only supports two compaction strategies, it leaves the option open to add more compaction strategy in future.

Special case where we need to retain LEO (log-end-offset) record / create an empty message batch for non-offset based compaction strategy: 

Today with the offset-only compaction strategy, the last record of the log (we call it the log-end-record, whose offset is log-end-offset) would always be preserved and not compacted. This is kinda important for replication since followers reason about the log-end-offset on the leader. Consider this case: three replicas of a partition, leader 1 and follower 2 and 3.

Leader 1 has records a, b, c, d and d is the current last record of the partition, the current log-end-offset is 3 (assuming record a's offset is 0).

Follower 2 has replicated a, b, c, d. Log-end-offset is 3 Follower 3 has replicated a, b, c but not yet replicated d. Log-end-offset is 2.

NOTE: that the compaction triggering are independent on brokers, it is possible that leader 1 triggers compaction and deletes record d, while other followers have not triggered compaction yet. At this moment the leader's log becomes a, b, c. Now let's say follower 3 fetch from leader after the compaction, it will no longer see record d.

Now suppose there's a leader migration and follower 3 becomes the new leader, it would accept new appends (say, it's e), and record e would be appended at *offset 3 *on new leader 3's log. But follower 2's offset 3's record is d still. Later let's say follower 2 also triggers compaction and also fetches the new record e from new leader 3:

Follower 2's log would be* a(0), b(1), c(2), e(4)* where the numbers in brackets are offset number; while leader 3's log would be *a(0), b(1), c(2), e(3)*. Now you see the two logs diverges in offsets, although their log entries are the same.

One way to resolve this, is to simply never remove the last message during compaction. Another way (suggested by Jason in the old VOTE thread) is to create an empty message batch to "take up" that offset slot.

Acknowledgement: we thank the previous author of this KIP proposal, Luís Cabral.

Public Interfaces

Adding below new configuration properties in both broker level and topic level configuration:

Broker Level:

  1. log.cleaner.compaction.strategy
  2. log.cleaner.compaction.strategy.header

Topic Level:

  1. compaction.strategy
  2. compaction.strategy.header

Proposed Changes

Compatibility, Deprecation, and Migration Plan

Following the above proposed changes, there are no compatibility issues. However to migrate existing topic to use header strategy, we are proposing below sequence to avoid inconsistency during migration:

Recommendations

Rejected Alternatives

      (This section remains the same as previous proposal.)