Status
Current state: Pre-release
Discussion thread: here (<- link to https://mail-archives.apache.org/mod_mbox/flink-dev/)
JIRA:
Released: unreleased
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
A crossGroup operation is performed in most non-iterative Gelly algorithms including the similarity measures AdamicAdar and JaccardIndex and as well as TriangleListing as a basis for clustering algorithms. Work is in-progress to add bipartite support and each of the four projection methods will perform a crossGroup. A built-in operator will greatly reduce the complexity of these and future algorithm implementations both in Gelly and for all Flink users.
crossGroup as a new operator will (as noted in FLINK-1267) work in lieu of reduceGroup (as in TriangleListing), which is memorybound and requires disabling object reuse or a type parameter implementing CopyableValue, or a self-join, which duplicates the input and builds the full Cartesian product. The crossGroup operator can also optionally reduce the data skew caused by the quadratic expansion of group pairs (for a group of size n this is either n^2 for the full Cartesian product or (n choose 2) for distinct pairs).
Public Interfaces
GroupCrossFunction similar to CrossFunction but bound to one input type parameter.
GroupCrossHint enum with values OPTIMIZER_CHOOSES, UNIFORM_DISTRIBUTION, SKEWED_DISTRIBUTION.
groupCross methods on SortedGrouping and UnsortedGrouping.
Proposed Changes
Compatibility, Deprecation, and Migration Plan
None required as this is a new feature.
Test Plan
JaccardIndex (skewed distribution) and TriangleListing (uniform distribution) will be ported to use the new operator and the performance compared with the current ad hoc implementations.
Rejected Alternatives
(none so far)