...
The following is a list of potential improvements to the log cleaner.
Simple Things
Integrate with
...
system test framework
...
We have a good integration test, it would be nice to hook that in to the nightly test run.
Add
...
Add a tool to measure the duplication in a log
It would be nice to have an operational tool to check the duplication within a log. This could be built as a simple consumer that takes in a particular topic/partition and consumes that log sequentially and estimate the duplication. Each key consumed would be checked against a bloom filter. If it is present we would count a duplicate, otherwise we would add it to the filter. A large enough bloom filter could probably produce an accurate-enough estimate of duplication rate.
Improve dedupe buffer efficiency
...