You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Status

Current state: Pre-release

Discussion threadhere (<- link to https://mail-archives.apache.org/mod_mbox/flink-dev/)

JIRA: Unable to render Jira issues macro, execution error.

Released: unreleased

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

A crossGroup operation is performed in most non-iterative Gelly algorithms including the similarity measures AdamicAdar and JaccardIndex and as well as TriangleListing as a basis for clustering algorithms. Work is in-progress to add bipartite support and the four projection methods will perform a crossGroup. A built-in operator will greatly reduce the complexity of these and future algorithms both in Gelly and for all Flink users.

Flink provides many join operators and the crossGroup operator will essentially implement a self-join.

Public Interfaces

GroupCrossFunction similar to CrossFunction but bound to one input type parameter.

GroupCrossHint enum with values OPTIMIZER_CHOOSES, UNIFORM_DISTRIBUTION, SKEWED_DISTRIBUTION.

groupCross methods on SortedGrouping and UnsortedGrouping.

Proposed Changes

 

Compatibility, Deprecation, and Migration Plan

None required as this is a new feature.

Test Plan

JaccardIndex (skewed distribution) and TriangleListing (uniform distribution) will be ported to use the new operator and the performance compared with the current ad hoc implementations.

Rejected Alternatives

(none so far)

  • No labels