...

In order to address these issues and improve the relevant Flink design as much as possible, this FLIP proposes to add a couple APIs in the flink-ml repository to achieve the following goals:

...

We explain a few terminologies in the following to facilitate the reading understanding of this doc.

1) Feedback streams

We consider a stream to be a feedback stream if computation of records in this stream depends on the records in the stream itself. In other words, there is a circle in the Flink graph that generated this stream.

Note that the Flink core runtime supports only directed-acyclic-graph of operators. Thus, in order to support cyclic graph of operators, some "magic" needs to be done, which we will describe in the rest of this doc.

2) Iteration body

An iteration body is a subgraph of operators that implements the computation logic of e.g. an iterative machine learning algorithm. In particular, the iteration body will output values into some feedback streams and take values from the same feedback streams as part of its inputs.

Target Use-cases

The target use-cases (i.e. algorithms) can be described w.r.t. the properties described below. In the following, we first describe the definitions of properties, followed by the combinations of the property choices supported by the existing APIs and the proposed APIs, respectively.

Definitions of properties

Different algorithms might have different required properties for the input datasets (bounded or unbounded), synchronization between parallel subtasks (sync or async), amount of data processed for every variable update (a batch/subset or the entire dataset). We describe each of these properties below.

1) Algorithms have different needs for whether the input data streams should be bounded or unbounded. We classify those algorithms into online algorithm and offline algorithms as below.

For online training algorithms, the training samples will be unbounded streams of data. The corresponding iteration body should ingest these unbounded streams of data, read each value in each stream once, and update machine learning model repeatedly in near real-time. The iteration will never terminate in this case. The algorithm should be executed as a streaming job.

For offline training algorithms, the training samples will be bounded streams of data. The corresponding iteration body should read these bounded data streams for arbitrary number of rounds and update machine learning model repeatedly until a termination criteria is met (e.g. a given number of rounds is reached or the model has converged). The algorithm should be executed as a batch job.

2) Algorithms (either online or offline) have different needs of how their parallel subtasks.

In the sync mode, parallel subtasks, which execute the iteration body, update the model variables in a coordinated manner. There exists global epoch epochs, such that all subtasks read the shared model variables at the beginning of an epoch, calculate variable updates based on the fetched variable values, and write updates of the variable values at the end of this epoch.

In the async mode, each parallel subtask, which execute the iteration body, could read/update the shared model variables without waiting for variable updates from other subtask. For example, a subtask could have updated model variables 10 times when another subtask has updated model variables only 3 times.

The sync mode is useful when an algorithm should be executed in a deterministic way to achieve best possible accuracy, and the straggler issue (i.e. there is subtask which is considerably slower than others) does not cause slow down the algorithm execution too much. In comparison, the async mode is useful for algorithms which want to be parallelized and executed across many subtasks as fast as possible, without worrying about performance issue caused by stragglers, at the possible cost of reduced accuracy.

3) An algorithm may have additional requirements in how much data should be consumed each time before a subtask can update variables. There are two categories of choices here:

Per-batch variable update: The algorithm wants to update variables every time an arbitrary subset of the user-provided data streams (either bounded or unbounded) is processed.

Per-round variable update: The algorithm wants to update variables every time all data of the user-provided bounded data streams is processed.

In the machine learning domain, some algorithms allow users to configure a batch size and the model will be updated every time each subtask processes a batch of data. Those algorithms fits into the first category. And such an algorithm can be either online or offline.

Other algorithms only update variables every time the entire data is consumed for one round. Those algorithms fit into the second category. And such an algorithm must be offline because, by this definition, the user-provided dataset must be bounded.

Combinations of property choices supported by the existing APIs

The existing DataSet::iterate supports algorithms that process bounded streams + sync mode + per-round variable update.

The existing DataStream::iterate has a few bugs that prevent it from being used in production yet (see FLIP-15). Other than this, it expects to support algorithms that process unbounded streams + async mode + per-batch variable update.

Combinations of categories supported by the proposed APIs

As described above, there are 3 definition of properties where each property has 2 choices. So there are a total of 8 combination of property choices. Since that the "per-round variable update" can not be used with "unbounded data streams", only 6 out of the 8 choices are valid.

The APIs proposed in this FLIP support all the 6 valid combinations of property choices.

Summary

The following table summarizes the use-cases supported by the existing APIs and proposed APIs, respectively, with respect to the properties defined above.

...

, whose outputs might be be fed back as the inputs of this subgraph. Therefore there is circle in the Flink graph if the Flink program has an iteration body.

Note that not all outputs of an iteration body has to be fed back as the inputs of this subgraph.

2) Feedback stream

For a given iteration body, a stream is said to be a feedback stream if it connects an output of this iteration body back to the input of this iteration body.

3) Epoch of records

In the proposed APIs, for any given record generated by the iteration body, we define the epoch of this record to be the number of times the iteration body has been invoked in the computation history of this record. The exact definition of epoch can be found in the Java doc of the IterationUtils class below.

Note that epoch is also a term commonly used in the context of machine learning to indicate the number of passes the entire training dataset the machine learning algorithm has processed. We denote this definition as "classic definition of epoch" below.

Our definition of the term "epoch" is pretty much a natural extension of the "classic definition of epoch" to the context of asynchronous machine learning on the unbounded streams. We make the following observations regarding their comparison:

The classic definition of epoch is well-defined only when the machine learning algorithm processes bounded streams AND all subtasks of the algorithm update the model variables synchronously.
Our definition of epoch can also be applies in the cases where the machine learning algorithm processes unbounded streams or subtasks of the algorithm update the model variables asynchronously.
In the cases where the class definition of epoch is well-defined (defined above), if the machine learning algorithm updates the model variables once after making a pass of the training dataset, then these two definitions of the epoch are exactly the same.

Target Use-cases

The target use-cases (i.e. algorithms) can be described w.r.t. the properties described below. In the following, we first describe the definitions of properties, followed by the combinations of the property choices supported by the existing APIs and the proposed APIs, respectively.

Properties of machine learning algorithms

Different algorithms might have different required properties for the input datasets (bounded or unbounded), synchronization between parallel subtasks (sync or async), amount of data processed for every variable update (a batch/subset or the entire dataset). We describe each of these properties below.

1) Algorithms have different needs for whether the input data streams should be bounded or unbounded. We classify those algorithms into online algorithm and offline algorithms as below.

For online training algorithms, the training samples will be unbounded streams of data. The corresponding iteration body should ingest these unbounded streams of data, read each value in each stream once, and update machine learning model repeatedly in near real-time. The iteration will never terminate in this case. The algorithm should be executed as a streaming job.

For offline training algorithms, the training samples will be bounded streams of data. The corresponding iteration body should read these bounded data streams for arbitrary number of rounds and update machine learning model repeatedly until a termination criteria is met (e.g. a given number of rounds is reached or the model has converged). The algorithm should be executed as a batch job.

2) Algorithms (either online or offline) have different needs of how their parallel subtasks.

In the sync mode, parallel subtasks, which execute the iteration body, update the model variables in a coordinated manner. There exists global epoch epochs, such that all subtasks read the shared model variables at the beginning of an epoch, calculate variable updates based on the fetched variable values, and write updates of the variable values at the end of this epoch.

In the async mode, each parallel subtask, which execute the iteration body, could read/update the shared model variables without waiting for variable updates from other subtask. For example, a subtask could have updated model variables 10 times when another subtask has updated model variables only 3 times.

The sync mode is useful when an algorithm should be executed in a deterministic way to achieve best possible accuracy, and the straggler issue (i.e. there is subtask which is considerably slower than others) does not cause slow down the algorithm execution too much. In comparison, the async mode is useful for algorithms which want to be parallelized and executed across many subtasks as fast as possible, without worrying about performance issue caused by stragglers, at the possible cost of reduced accuracy.

3) An algorithm may have additional requirements in how much data should be consumed each time before a subtask can update variables. There are two categories of choices here:

Per-batch variable update: The algorithm wants to update variables every time an arbitrary subset of the user-provided data streams (either bounded or unbounded) is processed.

Per-round variable update: The algorithm wants to update variables every time all data of the user-provided bounded data streams is processed.