Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

A single record can end up in different tasks while checkpointing.

Therefore, 

  1. without rescaling, even with fan-in topology, downstream needs to match incomplete record with the upstream; but this can be handled by having separate input channels
  2. output buffers need to be sent to the right channel; this is a problem if we rely on keys and don't 
  3. when scaling-in upstream,  buffers of incomplete records need to be matched
  4. when scaling-in downstream, there can be multiple incomplete records in input buffers (it can be a single input channel after recovery)

...

Looks like indices are assigned arbitrarily, so we need to use sub-task-id in SubPartition (and associate it with output buffers file).

Ad-hoc vs continuous spilling

Depends on throughput and latency - running POCs.

Avoiding double processing in downstream (if continuous spilling)

...

  1. downstream loads upstream output buffers file and filters out what it has already processed; files are named deterministically and indexed
  2. same, but offset in the file is calculated based on buffer size
  3. downstream communicates its offset to upstream during snapshotting
  4. downstream communicates its offset to upstream during recovery

Existing persistence mechanisms vs custom

Depends on:

  1. ad-hoc or cont. spilling (cont. spilling can require special handling because of higher volumes)
  2. single-task channel processing (isn't compatible with current implementation)

Decisions made

Ad-hoc vs continuous spilling

Ad-hoc POC showed much better efficiency so discard continuous spilling for now (2020-02-19).

Whether to process in and out buffers Whether to process in and out buffers of a single channel in a single task (downstream?) or not (upstream + downstream)

...

  1. tight coupling between upstream and downstream (naming should be consistent across all nodes)
  2. doesn't work with existing persistence mechanisms (question)

Existing persistence mechanisms vs custom

Depends on:

...

Tentatively chosen not to read upstream data in downstream because of efficiency (ready many small files) and easier implementation.

Compatibility, Deprecation, and Migration Plan

...