Discussion thread	here (<- link to https://lists.apache.org/list.html?dev@flink.apache.org)
Vote thread	TBD
JIRA	TBD
Release	<Flink Version>

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

In DataStream API, a DataStream can be transformed into a KeyedStream through keyBy, and then further transformed into a WindowedStream through window operations. WindowedStream enables window processing on records with the same key. With the improvement of FLIP-331, WindowedStream will support the full window processing by the window assigner EndOfStreamWindows for which the window is only triggered at the end of inputs.

However, full window processing is not supported directly by DataStream. This means the DataStream cannot collect all records of each subtask (these records have no keys) separately into a full window and process them at the end of inputs. DataSet API already supports processing and sorting all records within each subtask through mapPartition API and sortPartition API. As DataSet API has been deprecated in Flink 1.18, it is necessary to enhance the DataStream to support handling full window processing on individual subtasks.

In this FLIP, we propose two main enhancements. Firstly, we propose enabling DataStream to directly transform into a PartitionWindowedStream. The PartitionWindowedStream represents collecting all records of each subtask separately into a full window. Secondly, we propose supporting four APIs on PartitionWindowedStream, including mapPartition, sortPartition, aggregate and reduce.

Public Interfaces

1. We introduce the fullWindowPartition method to the DataStream class.

2. We introduce the PartitionWindowedStream that extends the DataStream.

3. We add four APIs to PartitionWindowedStream, including mapPartition, sortPartition, aggregate and reduce.

Proposed Changes

Support full window processing on DataStream

We propose to only support full window processing on DataStream. It's difficult to support arbitrary types of windows on DataStream, including count windows, sliding windows, and session windows. The main reason is that the DataStream is non-keyed and does not support keyed statebackend and keyed raw state. This issue results in two conflicts on the usage of various windows:

1. The storage of window state relies on the InternalKvState provided by KeyedStateBackend. The storage of timers in time service also relies on keyed raw state.

2. The recovery of window state and timers relies on the assumption that every record has a key. The window state and timers will be stored and distinguished by the key. In fault tolerance and rescaling scenarios, each subtask must be assigned a key range to recover the window state and timers correctly.

Furthermore, based on community feedbacks, there is currently no demand for arbitrary window processing on DataStream. However, it is worth noting that the DataSet API already offers full window processing capabilities. Integrating this existing capability into DataStream would enhance its functionality and better meet the needs of users.

Therefore, we propose to only support full window processing on DataStream. This feature is designed to work only when RuntimeExecutionMode=BATCH and does not support checkpoint. If user specifies RuntimeExecutionMode=STREAMING, the job will be failed to submit. Without the requirement for checkpoint, the underlying implementation can no longer rely on state and avoid the aforementioned conflict issues.

The definition of PartitionWindowedStream

We first propose to add a new method fullWindowPartition() in DataStream class, which represents openning a full window on DataStream

The PartitionWindowedStream does not support setting the parallelism and has the same parallelism with previous operator.

The PartitionWindowedStream class will be annotated by @PublicEnvolving as more APIs may be added in the future.

Here is the fullWindowPartition method and PartitionWindowedStream class.

public class DataStream<T> {
  
  ...
  
  /**
   * Collect records in each subtask of this data stream separately into a full 
   * window. The window emission will be triggered at the end of inputs.
   *
   * @return The data stream full windowed on each subtask.
   */
  public PartitionWindowedStream<T> fullWindowPartition() {
    ...
  }
}

/**
 * {@link PartitionWindowedStream} represents a data stream that collects all 
 * records of each subtask separately into a full window. Window emission will 
 * be triggered at the end of inputs.
 *
 * @param <T> The type of the elements in this stream.
 */
public class PartitionWindowedStream<T> extends DataStream<T> {
  ...
}

API Implementation

MapPartition

We introduce the mapPartition API in the PartitionWindowedStream.

public class PartitionWindowedStream<T> extends DataStream<T> {

    /**
     * Process the records of the window by {@link MapPartitionFunction}. The 
     * records will be available in the given {@link Iterator} function 
     * parameter of {@link MapPartitionFunction}.
     *
     * @param mapPartitionFunction The {@link MapPartitionFunction} that is 
     * called for the records in the full window.
     * @return The resulting data stream.
     * @param <R> The type of the elements in the resulting stream, equal to the
     *     MapPartitionFunction's result type.
     */
    public <R> DataStream<R> mapPartition(MapPartitionFunction<T, R> mapPartitionFunction) {
      ...
    }
}

We also show the definition of MapPartitionFunction.

public interface MapPartitionFunction<T, O> extends Function, Serializable {

    /**
     * A user-implemented function that modifies or transforms an incoming 
     * object.
     *
     * @param values All records for the mapper
     * @param out The collector to hand results to.
     * @throws Exception This method may throw exceptions. Throwing an 
     * exception will cause the operation to fail and may trigger recovery.
     */
    void mapPartition(Iterable<T> values, Collector<O> out) throws Exception;
}

Next, we will propose the operator implementation for the mapPartition API .

In our implementation, the operator will execute the MapPartitionFunction while receiving records, rather than waiting for the entire window of records to be collected. To achieve this, we add a seperate UDFExecutionThread inside the operator.

The TaskMainThread will cyclically add records to a fixed-size queue. The UDFExecutionThread will invoke user-defined MapPartitionFunction and cyclically poll records from the queue in the Iterator parameter of MapPartitionFunction. If there is no records in the queue, the UDFExecutionThread blocks and waits on the hasNext() and next() methods of the Iterator. Once the UDFExecutionThread has processed all the data, the operator completes its execution.

The following diagram illustrates the interaction between the TaskMainThread and the UDFExecutionThread：

SortPartition

We introduce the sortPartition API in the PartitionWindowedStream, including three methods to sort records by different key extraction logics.

public class PartitionWindowedStream<T> extends DataStream<T> {

    /**
     * Sorts the records of the window on the specified field in the 
     * specified order. The type of records must be {@link Tuple}.
     *
     * @param field The field index on which records is sorted.
     * @param order The order in which records is sorted.
     * @return The resulting data stream with sorted records in each subtask.
     */
    public DataStream<T> sortPartition(int field, Order order) {
      ...
    }

    /**
     * Sorts the records of the window on the specified field in the 
     * specified order. The type of records must be {@link Tuple} or POJO 
     * class. The POJO class must be public and have getter and setter methods 
     * for each field. It mustn't implement any interfaces or extend any 
     * classes.
     * 
     * @param field The field expression referring to the field on which 
     * records is sorted.
     * @param order The order in which records is sorted.
     * @return The resulting data stream with sorted records in each subtask.
     */
    public DataStream<T> sortPartition(String field, Order order) {
      ...
    }

    /**
     * Sorts the records of the window on the extracted key in the specified order.
     *
     * @param keySelector The KeySelector function which extracts the key 
     * from records.
     * @param order The order in which records is sorted.
     * @return The resulting data stream with sorted records in each subtask.
     */
    public <K> DataStream<T> sortPartition(KeySelector<T, K> keySelector, Order order) {
      ...
    }
  
}

We also show the definition of Order, which defines the rules for sorting the records.

public enum Order {
    /** Indicates an ascending order. */
    ASCENDING,

    /** Indicates a descending order. */
    DESCENDING
}

Next, we will propose the operator implementation for the sortPartition API.

The TaskMainThread will add records to the ExternalSorter, which is a multi-way merge sorter for sorting large amounts of data that cannot totally fit into memory. The ExternalSorter will sort the records according to the Order and send the sorted records to output at the end of inputs. The following diagram illustrates the interaction between the TaskMainThread and the ExternalSorter:

Aggregate

We introduce the aggregate API in the PartitionWindowedStream.

public class PartitionWindowedStream<T> extends DataStream<T> {
    /**
     * Applies the given aggregate function to the records of the window. The 
     * aggregate function is called for each element, aggregating values 
     * incrementally in the window.
     *
     * @param aggregateFunction The aggregation function.
     * @return The resulting data stream.
     * @param <ACC> The type of the AggregateFunction's accumulator.
     * @param <R> The type of the elements in the resulting stream, equal to 
     * the AggregateFunction's result type.
     */
    public <ACC, R> DataStream<R> aggregate(AggregateFunction<T, ACC, R> aggregateFunction) {
      ...
    }
  
}

Next, we will propose the operator implementation for the aggregate API.

The TaskMainThread will first create a accumulator by invoking the AggregateFunction#createAccumulator, and then compute each record by the accumulator in AggregateFunction#add. At the end of inputs, the TaskMainThread will get the result from the accumulator and send it to output.

Reduce

We introduce the reduce API in the PartitionWindowedStream.

public class PartitionWindowedStream<T> extends DataStream<T> {

    /**
     * Applies a reduce transformation on the records of the window. The 
     * {@link ReduceFunction} will be called for every record in the window.
     *
     * @param reduceFunction The reduce function.
     * @return The resulting data stream.
     */
    public DataStream<T> reduce(ReduceFunction<T> reduceFunction) {
      ...
    }
}

Next, we will propose the operator implementation for the reduce API.

The TaskMainThread will invoke ReduceFunction#reduce for each record in the window and only send the final result to output at the end of inputs.

Rejected Alternatives

1. In the implementation of mapPartition API, make operator do not cache records. The UDFExecutionThread must execute synchronously with TaskMainThread to get the next record and process it in MapPartitionFunction.

We choose not to use this approach because it will result in inefficient execution due to frequent thread context switching between TaskMainThread and UDFExecutionThread. Caching records enables concurrent execution of the two threads and reduces the frequency of context switching.

Compatibility, Deprecation, and Migration Plan

This FLIP proposes a new feature in Flink. It is fully backward compatible.

Test Plan

We will provide unit and integration tests to validate the proposed changes.

Page tree

FLIP-380: Support Full Partition Processing On Non-keyed DataStream