Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • The DataSet::iterate(...) only supports iteration on the bounded data streams. And as described in FLIP-131, we will deprecate the DataSet API in favor of the Table API/SQL in the future.
  • The DataStream::iterate(...) has a few design issues that prevents it from being reliably used in production jobs. Many of these issues, such as possibility of deadlock, are described in FLIP-15.

In order to address the issues described above and provide a long term solution for iteration on both bounded and unbounded data streams, this This FLIP proposes to add a couple APIs in the the flink-ml repository to achieve the following goals:

  • Provide solution for all the iteration use-cases (see the use-case section below for more detail) supported by the existing APIs, without the issues described above.
  • Provide solution for a few additional use-cases (e.g. bounded streams + async mode + per-round variable update) not supported by the existing APIs.


Note that we have chosen to put the iteration API (and its implementation) in the flink-ml repository instead of the DataStream class in the Flink core repository, because we believe it is important to keep the Flink core runtime as simple and maintainable as possible.

...

The target use-cases (i.e. algorithms) can be described w.r.t. the categories described below. In the following, we first describe the categories, followed by the combinations of those categories supported by the existing APIs and the proposed APIs, respectively.

Categories of Algorithms

Different algorithms might have different requirements for the input datasets (bounded or unbounded), synchronization between parallel subtasks (sync or async), amount of data processed for every variable update (a batch/subset or the entire dataset). We describe each of these requirements below.

...

Combinations of categories supported by the existing DataSet::iterate and DataStream::iterateAPIs

The existing DataSet::iterate supports algorithms that process bounded streams of data (i.e. offline training), under + sync mode , with + per-round variable update.

The existing DataStream::iterate has a few bugs that prevent it from being used in production yet (see FLIP-15). Other than this, it expects to support algorithms that process unbounded streams of data (i.e. online training), under + async mode , with + per-batch variable update.

Combinations of categories to be supported by the proposed APIs proposed in this FLIP

The proposed APIs support 6 out of the 8 combinations of the above categories. The following 2 combinations are not supported because, by definition, "per-round variable update" can only be used with bonded data streams.

...