Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This is needed to maintain correspondence between upstream output buffers and downstream input buffers.

Looks like indices are assigned arbitrarily, so we need to use sub-task-id in SubPartition (and associate it with output buffers file).

Ad-hoc vs continuous spilling

Depends on throughput and latency - running POCs.

Avoiding double processing in downstream (if continuous spilling)

With continuous spilling, upstream saves all its output which includes what was already processed/buffered by the downstream.

Possible solutions:

  1. downstream loads upstream output buffers file and filters out what it has already processed; files are named deterministically and indexed
  2. same, but offset in the file is calculated based on buffer sizedownstream reads upstream buffers and filters out what it has already processed; files are indexed
  3. downstream communicates its offset to upstream during snapshotting
  4. downstream communicates its offset to upstream during recovery

...

Whether to process in and out buffers of a single channel in a single task (downstream?) or not (upstream + downstream)

Single-channel pros:

  1. solves the problem of multi-buffer records (question)
  2. solves the problem of double-processing with cont. spilling (question)
  3. ...

Single-channel cons:

  1. tight coupling between upstream and downstream (naming should be consistent across all nodes)
  2. doesn't work with existing persistence mechanisms (question)

Existing persistence mechanisms vs custom

Depends on:

  1. ad-hoc or cont. spilling (cont. spilling can require special handling because of higher volumes)
  2. single-task channel processing (isn't compatible with current implementation)

Compatibility, Deprecation, and Migration Plan

...