Status
Current state: Under Discussion
Discussion thread: https://lists.apache.org/thread.html/79aa6e50d7c737ddf83455dd8063692a535a1afa558620fe1a1496d3@<dev.kafka.apache.org>
JIRA: ---
PULL REQUEST: https://github.com/apache/kafka/pull/4822
Motivation
In order to use Kafka as the message broker within an Event Source architecture, it becomes essential that Kafka is able to reconstruct the current state of the events in a "most recent snapshot" approach.
This is where log compaction becomes an essential part of the workflow, as only the latest state is of interest. At the moment, Kafka accomplishes this by considering the insertion order (or highest offset) as a representation of the latest state.
The issue then occurs when the insertion order is not guaranteed, which causes the log compaction to keep the wrong state. This can be easily replicated when using a multi-threaded (or simply multiple) producer(s), or when sending the events asynchronously.
Public Interfaces
There are no changes to the public interfaces.
Proposed Changes
- Enhance log compaction to support more than just offset comparison, so the insertion order isn't always dictating which records to keep (in effect, allowing for a form of OCC);
- The current behavior should remain as the default in order to minimize impact on already existing clients and avoid any migration efforts;
- Add new Kafka configuration "log.cleaner.compaction.strategy" to toggle the compaction strategy to this approach;
- Add new Topic configuration "compaction.strategy" representing the same as above;
- The default value of these configurations should be "offset", which toggles to the current behavior;
- Specifically changing this to anything other than "offset" will cause the record headers to be scanned for a key matching this value. If this header is found, and its value is cast-able to "long", then this value will be used to determine which record to keep, in a 'keep-highest' approach;
Details of the change can be viewed from the pull request.
Compatibility, Deprecation, and Migration Plan
Following the proposed changes, there are no compatibility issues and no migration is required.
Rejected Alternatives
- Stream the data out of Kafka and perform Event Sourcing there
- This would mean creating an in-house solution, which makes Kafka irrelevant in the design, and so its best left as a last-approach in case no solution is found on Kafka-side
- Guarantee insertion order on the producer
- Not viable as keeping this logic synchronized greatly reduces the event throughput
- Check the version before sending the event to Kafka
- Similar to the previous point, though it adds even more extra complexity as race-conditions may arise