Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

A self-join is a join whose left and right-side arguments are the same entity (a stream  stream reading from the same topic). Although self-joins are currently supported in Streams, their implementation is inefficient as they are implemented like regular joins where a state store is created for both left and right join arguments. Since both of these arguments represent the same entity, we don't need to create two state stores (as they will contain the exact same data) but only one. This optimization is only suitable for inner joins whose join condition is on the primary key. We do not consider foreign-key joins as we would need to create a state store for both arguments in order to be able to do efficient lookups. Hence, we will handle foreign-key self joins as regular inner foreign-key joins. Moreover, we do not consider outer joins since we are focusing on primary key joins and there will always be at least one join result, the current record joining with itself. 

...

This optimization only applies to Stream-Stream inner joins. A Table-Table inner joins have both sides already materialized into state stores that the user created. Since this optimization aims to remove a state created specifically by Streams to perform the join, it is not applicable to table-table joinsself-join would not yield interesting results since a row would always join with itself. Moreover, the optimization does not apply to N-way self-joins as this notion doesn't exist. Joins are binary operators and once a self-join has been applied, the result contains records whose columns are the concatenation of the columns of the left and right join arguments. A subsequent join will be applied on the result of the previous join and not on the inputs of the previous join which means it won't be a self-join anymore. 

...