...

A single record can end up in different tasks while checkpointing.

Therefore,

without rescaling, even with fan-in topology, downstream needs to match incomplete record with the upstream; but this can be handled by having separate input channels
output buffers need to be sent to the right channel; this is a problem if we rely on keys and don't
when scaling-in upstream, buffers of incomplete records need to be matched
when scaling-in downstream, there can be multiple incomplete records in input buffers (it can be a single input channel after recovery)

...

Looks like indices are assigned arbitrarily, so we need to use sub-task-id in SubPartition (and associate it with output buffers file).

Ad-hoc vs continuous spilling

Depends on throughput and latency - running POCs.

Avoiding double processing in downstream (if continuous spilling)

...

downstream loads upstream output buffers file and filters out what it has already processed; files are named deterministically and indexed
same, but offset in the file is calculated based on buffer size
downstream communicates its offset to upstream during snapshotting
downstream communicates its offset to upstream during recovery

Existing persistence mechanisms vs custom

Depends on:

ad-hoc or cont. spilling (cont. spilling can require special handling because of higher volumes)
single-task channel processing (isn't compatible with current implementation)

Decisions made

Ad-hoc vs continuous spilling

Ad-hoc POC showed much better efficiency so discard continuous spilling for now (2020-02-19).

Whether to process in and out buffers Whether to process in and out buffers of a single channel in a single task (downstream?) or not (upstream + downstream)

...

tight coupling between upstream and downstream (naming should be consistent across all nodes)
doesn't work with existing persistence mechanisms

Existing persistence mechanisms vs custom

Depends on:

...

Tentatively chosen not to read upstream data in downstream because of efficiency (ready many small files) and easier implementation.

Compatibility, Deprecation, and Migration Plan

...

Page tree

Versions Compared

Old Version 21

New Version 22

Key

Ad-hoc vs continuous spilling

Avoiding double processing in downstream (if continuous spilling)

Existing persistence mechanisms vs custom

Decisions made

Ad-hoc vs continuous spilling

Whether to process in and out buffers Whether to process in and out buffers of a single channel in a single task (downstream?) or not (upstream + downstream)

Existing persistence mechanisms vs custom

Compatibility, Deprecation, and Migration Plan

Page tree

Page History

Versions Compared

Old Version 21

New Version 22

Key

Ad-hoc vs continuous spilling

Avoiding double processing in downstream (if continuous spilling)

Existing persistence mechanisms vs custom

Decisions made

Ad-hoc vs continuous spilling

Whether to process in and out buffers Whether to process in and out buffers of a single channel in a single task (downstream?) or not (upstream + downstream)

Existing persistence mechanisms vs custom

Compatibility, Deprecation, and Migration Plan