Status
Current state: Pre-release
Discussion thread: here (<- link to https://mail-archives.apache.org/mod_mbox/flink-dev/)
JIRA:
Released: unreleased
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
A crossGroup operation is performed in most non-iterative Gelly algorithms including the similarity measures AdamicAdar and JaccardIndex and as well as TriangleListing as a basis for clustering algorithms. Work is in-progress to add bipartite support and the four projection methods will perform a crossGroup. A built-in operator will greatly reduce the complexity of these and future algorithms both in Gelly and for all Flink users.
Flink provides many join operators and the crossGroup operator will essentially implement a self-join.
Public Interfaces
GroupCrossFunction similar to CrossFunction but bound to one input type parameter.
GroupCrossHint enum with values OPTIMIZER_CHOOSES, UNIFORM_DISTRIBUTION, SKEWED_DISTRIBUTION.
groupCross methods on SortedGrouping and UnsortedGrouping.
Proposed Changes
Compatibility, Deprecation, and Migration Plan
None required as this is a new feature.
Test Plan
JaccardIndex (skewed distribution) and TriangleListing (uniform distribution) will be ported to use the new operator and the performance compared with the current ad hoc implementations.
Rejected Alternatives
(none so far)