Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

One interesting question is: For exactly once delivery, would local snapshots be good enough? To answer this, imagine the following topology

Image RemovedImage Added

In the topology each node has two local snapshots. The snapshots of the same color means they are in the same global snapshot. Now if processing node(task) n1,n2 and n3 load snapshot_n1_1, snapshot_n2_2 and snapshot_n3_2 respectively. The state of n3 would be corresponding to the original source {(S1,1500), (S2,200)}.  In this case, n1 will reprocess [1001-1500] from S1 and emit some output as S_OUT. Assuming those output are [40-45] in S3. Notice that due to the unrepeatablity, the generated output may not necessarily be the same as the [21-30] in S3, which was the output from the previous run.

...

In Fig.1, there is a dedicated channel between two processing nodes, this means that inside each channel the boundary of the snapshots is only controlled by a single upstream processing node. The boundary can be indicated by a marker in the stream. In this case any message before the marker should be included into the previous snapshot, and anything after the marker should be included into the next snapshot. Therefore there is a clear lineage of the snapshot versions in the channel divided by the markers. However, if the communication channel between two nodes is shared, the problem would be much more complicated. Let's take a look at Fig.2:

Image RemovedImage Added

In the interest of explanation, we give the following definition:

...

In the processin pipeline, the S_OUT of a sink node is no longer controlled by the stream processing framework. So exactly once semantic will need the support from the state receiving system. (e.g. Kafka with KIP-98).

Flow Overview

Image RemovedImage Added

Processing Node Start (1.x)

...