...
'Distinct' operation is common in data processing, e. g.
- SQL
DISTINCT
keyword, - In standard libraries for programming languages
java.util.stream.Stream distinct()
method,- .NET LINQ
Distinct
method,
- Java Stream
distinct()
- ,
- Scala Seq
distinct()
,
- .NET LINQ
- In data processing frameworks:
Apache Spark's
distinct()
,- Apache Flink's
distinct()
, - Apache Beam's
Distinct()
Code Block language java firstline 1 KTable<Windowed<K>, V> distinct(final Named named); KTable<Windowed<K>, V> distinct(final Materialized<K, V, WindowStore<Bytes, byte[]>> materialized); KTable<Windowed<K>, V> distinct(final Named named, final Materialized<K, V, WindowStore<Bytes, byte[]>> materialized);
- ,
- Hazelcast Jet's
distinct()
,
- etc.
Hence it is natural to expect the similar functionality from Kafka Streams.
Although Kafka Streams Tutorials contains an example of how distinct
can be emulated, but this example is complicated: it involves low-level coding with local state store and a custom transformer. It might be much more convenient to have distinct
as a first-class DSL operation.
Due to 'infinite' nature of KStream, distinct
operation should be windowed, similar to windowed joins and aggregations for KStreams.
...
The following methods are added to the corresponding interfaces:
Code Block | ||||
---|---|---|---|---|
| ||||
KTable<Windowed<K>, V> distinct(final Named named);
KTable<Windowed<K>, V> distinct(final Materialized<K, V, WindowStore<Bytes, byte[]>> materialized);
KTable<Windowed<K>, V> distinct(final Named named,
final Materialized<K, V, WindowStore<Bytes, byte[]>> materialized); |
The distinct operation returns only a first record that falls into a new window, and filters out all the other records that fall into an already existing window.
...