...
This is needed to maintain correspondence between upstream output buffers and downstream input buffers.
Looks like indices are assigned arbitrarily, so we need to use sub-task-id in SubPartition (and associate it with output buffers file).
Ad-hoc vs continuous spilling
Depends on throughput and latency - running POCs.
Avoiding double processing in downstream (if continuous spilling)
With continuous spilling, upstream saves all its output which includes what was already processed/buffered by the downstream.
Possible solutions:
- downstream loads upstream output buffers file and filters out what it has already processed; files are named deterministically and indexed
- same, but offset in the file is calculated based on buffer sizedownstream reads upstream buffers and filters out what it has already processed; files are indexed
- downstream communicates its offset to upstream during snapshotting
- downstream communicates its offset to upstream during recovery
...
Whether to process in and out buffers of a single channel in a single task (downstream?) or not (upstream + downstream)
Single-channel pros:
- solves the problem of multi-buffer records
- solves the problem of double-processing with cont. spilling
- ...
Single-channel cons:
- tight coupling between upstream and downstream (naming should be consistent across all nodes)
- doesn't work with existing persistence mechanisms
Existing persistence mechanisms vs custom
Depends on:
- ad-hoc or cont. spilling (cont. spilling can require special handling because of higher volumes)
- single-task channel processing (isn't compatible with current implementation)
Compatibility, Deprecation, and Migration Plan
...