Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


This page is meant as a template for writing a FLIP. To create a FLIP choose Tools->Copy on this page and modify with your content and replace the heading with the next FLIP number and a description of your issue. Replace anything in italics with your own description.

Page properties



Document the state by adding a label to the FLIP page with one of "discussion", "accepted", "released", "rejected".

Discussion thread
here (<- link to https://lists.apache.org/list.html?dev@flink.apache.org)

Vote thread
here (<- link to https://lists.apache.org/list.html?dev@flink.apache.org)

JIRA
here (<- link to https://issues.apache.org/jira/browse/FLINK-XXXX)

Release
<Flink Version>
TBD


Introduction

As the first sub-FLIP for DataStream API V2, we'd like to discuss and try to answer some of the most fundamental questions in stream processing.

...

Above we defined the stream and how it is partitioned. The next topic to discuss is how to convert between different partition types. We call these transformations partitioning. For example non-keyed partition stream can be transformed to keyed partition stream via a `keyby` `KeyBy` partitioning.

Overall, we have the following four partitioning:

  • Keyby: Let all data be repartitioned according to specified keyKeyBy: Let all data be repartitioned according to specified key.

  • Shuffle: Repartition and shuffle all data.

  • Global: Merge all partitions into one.

  • Broadcast: Force partitions broadcast data to downstream.

The specific transformation relationship is shown in the following table:

Partitioning

Output

Global

Keyed

NonKeyed

Broadcast

Input

Global

KeyBy

Shuffle

Broadcast

Keyed

Global

KeyBy

Shuffle

Broadcast

NonKeyed

Global

KeyBy

Shuffle

Broadcast

Broadcast

(A crossed box indicates that it is not supported or not required)

...

Classification of ProcessFunction

...

According to the number of input / output, they are classified as follows:

Process Function

number of inputs

number of outputs

OneInputStreamProcessFunction

1

1

TwoInputStreamProcessFunction

2

1

TwoOutputStreamProcessFunction

1

2

Logically, process functions that support more inputs and outputs can be achieved  by combining them, but this implementation might be inefficient. If the call for this becomes louder, we will consider supporting as many output edges as we want through a mechanism like OutputTag. But this loses the explicit generic type information that comes with using ProcessFunction.

...

For OneInputStreamProcessFunction:

Input Stream

Output Stream

Global

Global

Keyed

Keyed / Non-Keyed

NonKeyed

NonKeyed

Broadcast

Not Supported

When KeyedPartitionStream is used as input, the output can be either a KeyedPartitionStream or NonKeyedPartitionStream. For general data processing logic, how to partition data is uncertain, we can only expect a NonKeyedPartitionStream. If we do need a deterministic partition, we can follow it with a keyby partitioning. However, there are times when we know for sure that the partition of records will not change before and after processing, shuffle cost due to the extra partitioning can be avoided. To be safe, in this case we ask for a KeySelector for the output data, and the framework checks at runtime to see if this invariant is broken. The same is true for two output and two input counterparts. For a more detailed explanation, see the API definition of KeyedPartitionStream in the Proposed Changes section belowWhen KeyedPartitionStream is used as input, the output can be either a KeyedPartitionStream or NonKeyedPartitionStream. For general data processing logic, how to partition data is uncertain, we can only expect a NonKeyedPartitionStream. If we do need a deterministic partition, we can follow it with a KeyBy partitioning. However, there are times when we know for sure that the partition of records will not change before and after processing, shuffle cost due to the extra partitioning can be avoided. To be safe, in this case we ask for a KeySelector for the output data, and the framework checks at runtime to see if this invariant is broken. The same is true for two output and two input counterparts. For a more detailed explanation, see the API definition of KeyedPartitionStream in the Proposed Changes section below.


For TwoOutputStreamProcessFunction:

Input Stream

Output Stream

Global

Global  + Global

Keyed

Keyed + Keyed / Non-Keyed + Non-Keyed

NonKeyed

NonKeyed + NonKeyed

Broadcast

Not Supported

There are two points to note here:

  1. Broadcast stream cannot be used as a single input.

...

  1. Generally speaking, when a keyed stream as input, its output should be non-keyed stream, because the original partition maybe change during processing. But if we provide an specific KeySelector, its output can be keyed partitioned.


Things with two inputs is a little more complicated. The following table lists which streams are compatible with each other and the types of streams they output. 

A cross(❎) indicates not supported.

Output

Input2

Global

Keyed

NonKeyed

Broadcast

Input1

Global

Global

Keyed

Non-Keyed / Keyed

Non-Keyed

NonKeyed

Non-Keyed

Non-Keyed

Broadcast

Non-Keyed

Non-Keyed

  1. The reason why the connection between Global Stream and Non-Global Stream is not supported is that the number of partitions of GlobalStream is forced to be 1, but it is generally not 1 for Non-Global Stream, which will cause conflicts when determining the number of partitions of the output stream. If necessary, they should be transformed into mutually compatible streams and then connected.
  2. Connecting two broadcast streams doesn't really make sense, because each parallelism would have exactly same input data from both streams and any process would be duplicated. 
  3. The reason why the output of two keyed partition streams can be keyed or non-keyed is the same as we mentioned above in the case of single input.
  4. When we connect two KeyedPartitioinStream, they must have the same key type, otherwise we can't decide how to merge the partitions of the two streams. At the same time, things like access state and register timer are also restricted to the partition itself, cross-partition interaction is not meaningful.
  5. The reasons why the connection between KeyedPartitionStream and NonKeyedPartitionStream is not supported are as follows:
    1. The data on KeyedStream is deterministic, but on NonKeyed is not. It is difficult to think of a scenario where the two need to be connected.
    2. This will complicate the state declaration and access rules. A more detailed discussion can be seen in the subsequent state-related sub-FLIP.
    3. If we see that most people have clear demands for this, we can support it in the future.

Lifecycle of Process Function

...