Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Each of the Gelly use cases only make use of distinct element pairs and so emitting the full Cartesian product require the UDF to ignore the unwanted half of the data by comparing the non-grouped fields. This is less efficient and requires that types implement Comparable. Given distinct pairs of element the full Cartesian product can be simulated by the UDF processing each pair both forwards and reversed, unioned with a map on the grouped DataSet; however, this eliminates any potential ordering on forwarded fields. 

The crossGroup implementation will be similar to many existing Flink operators. For a uniform distribution, the CrossGroupDriver requires a spillable iterator which tracks two elements; also, in the case of emitting distinct elements the iterator can discard elements prior to the outer iterator. For a skewed distribution the operator will compile into three nodes. Similar to JaccardIndex, the first node will reduceGroup the Grouping and wrap each element in a Tuple3 with a group count and an increasing index. The second node will rebalance and flatMap each element and index into 1..count groups. The third node will implement a partial crossGroup where the initial group_size elements are emitted pairwise and with each following element.

Compatibility, Deprecation, and Migration Plan

...