Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Since rescaling is the rarest of the processes that touch the data (normal processing > checkpointing > recovery > rescaling), we opted for a non-optimized version of rescaling. When, after rescaling, two operator instances need to load data from the same state handle, they need to deserialize all records, apply the key selector, and receive the keygroup index to filter relevant records. We expect rather small amount of data to be processed multiple times and think that storing the keygroup index inside the data would impact performance for small records.

draw.io Diagram
bordertrue
viewerToolbartrue
fitWindowfalse
diagramNamerescale.drawio
simpleViewerfalse
width
diagramWidth554
revision3

For non-keyed data, rescaling semantics is unfortunately a bit fuzzy. For this FLIP, we assume that no data of a given input split can overtake prior data in processing on forward channels. Any fan out or reshuffling will already destroy that ordering guarantee, so we can disregard these cases in this FLIP. If we focus on forward channels, however, we quickly run into situations where the ordering is violated (see Fig. 3).

...