Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The algorithm to identify whether two streams (or tables) represent the same entity will first check if the StreamSourceNode nodes that are the join arguments are reading from the same topic(s) and then, will do a traversal of the graph and will check if the path from the StreamSourceNode node to the root does not have nodes that can change the semantics of the stream. Initially, most (all) nodes will be disallowed to be on the path, only peek and print should be allowed. Later, we can look into adding some smarts into the algorithm to identify whether a mapper node actually changes the data in the stream or not.  If both conditions hold, then the join can be implemented as a self join. This algorithm is future-proof to support the addition of new DSL operators without affecting the behavior of the self-join. 

  • Implement the self-join operator 

The self-join operator will maintain one state store. For every new record, the record will first get added to the state store and then the operator will perform a lookup into the state store to do the actual join. This means that every record will at least join with itself.

  • Add a rule to the optimizer that will rewrite an inner join to a self-join. The graph (logical plan) created from the DSL excerpts above are the same (after the optimization mergeDuplicateSourceNodes) and look as follows:

...