This is needed to maintain correspondence between upstream output buffers and downstream input buffers.

Looks like indices are assigned arbitrarily, so we need to use sub-task-id in SubPartition (and associate it with output buffers file).

Depends on throughput and latency - running POCs.

With continuous spilling, upstream saves all its output which includes what was already processed/buffered by the downstream.

Possible solutions:

downstream loads upstream output buffers file and filters out what it has already processed; files are named deterministically and indexed
same, but offset in the file is calculated based on buffer sizedownstream reads upstream buffers and filters out what it has already processed; files are indexed
downstream communicates its offset to upstream during snapshotting
downstream communicates its offset to upstream during recovery

...

Single-channel pros:

Single-channel cons:

tight coupling between upstream and downstream (naming should be consistent across all nodes)
doesn't work with existing persistence mechanisms

Depends on:

ad-hoc or cont. spilling (cont. spilling can require special handling because of higher volumes)
single-task channel processing (isn't compatible with current implementation)

Compatibility, Deprecation, and Migration Plan

...

Page tree