Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

It would be nice to have an operational tool to check the duplication within a log. This could be built as a simple consumer that takes in a particular topic/partition and consumes that log sequentially and estimate the duplication. Each key consumed would be checked against a bloom filter. If it is present we would count a duplicate, otherwise we would add it to the filter. A large enough bloom filter could probably produce an accurate-enough estimate of duplication rate.

KAFKA-1336

Improve dedupe buffer efficiency

...