You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Status

Current state: Pre-release

Discussion threadhere (<- link to https://mail-archives.apache.org/mod_mbox/flink-dev/)

JIRA: Unable to render Jira issues macro, execution error. algorithms

Released: unreleased

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

A crossGroup operation is performed in most non-iterative Gelly algorithms including the similarity measures AdamicAdar and JaccardIndex and as well as TriangleListing as a basis for clustering algorithms. Work is in-progress to add bipartite support and each of the four projection methods will perform a crossGroup. A built-in operator will greatly reduce the complexity of these and future algorithm implementations both in Gelly and for all Flink users.

crossGroup as a new operator will (as noted in FLINK-1267) work in lieu of reduceGroup (as in TriangleListing), which is memorybound and requires disabling object reuse or a type parameter implementing CopyableValue, or a self-join, which duplicates the input and builds the full Cartesian product. The crossGroup operator can also optionally reduce the data skew caused by the quadratic expansion of group pairs (for a group of size n this is either n^2 for the full Cartesian product or (n choose 2) for distinct pairs).

Public Interfaces

GroupCrossFunction similar to CrossFunction but bound to one input type parameter.

GroupCrossHint enum with values OPTIMIZER_CHOOSES, UNIFORM_DISTRIBUTION, SKEWED_DISTRIBUTION.

groupCross methods on SortedGrouping and UnsortedGrouping.

Proposed Changes

 

Compatibility, Deprecation, and Migration Plan

None required as this is a new feature.

Test Plan

JaccardIndex (skewed distribution) and TriangleListing (uniform distribution) will be ported to use the new operator and the performance compared with the current ad hoc implementations.

Rejected Alternatives

(none so far)

  • No labels