Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Previously Flink supported bounded iteration with DataSet API and supported the unbounded iteration with DataStream API. However, since Flink aims to deprecate the DataSet API and the iteration in the DataStream API is rather incomplete, thus we would require to re-implement a new iteration library in the Flink-ml repository to support the algorithms. 

The

...

Goals

The Types of the Algorithms

In general a ML algorithm would update the model according to the data in iteration until the model is converged. According to the granularity of the dataset used to update the model, in general ML algorithms could be classified into two types:

...

*Although SGD-based algorithms are also batch-based, it they could be implemented with an Epoch-based method if intermediate state is allowed: the subtasks could sample a batch from all the records from the position of the last batch. 

...

Based on the above classification and the replacement implementation for SGD-based algorithms with bounded dataset, we mainly need to support

  1. The synchronous / asynchronous epoch-based algorithms on the bounded dataset.
  2. The synchronous / asynchronous batch-based algorithms on the unbounded dataset. 

The Goals of the Iteration Library

If we directly copy the current implementation of iteration on the DataSet And DataStream API, we would still meet with some problem, thus we would like to have some optimization to the existing iteration functionality.

The Round Semantics and Synchronization

At the iteration level, we would need the corresponding concept corresponding to Epoch and Batch. We would call processing one epoch as a round: users would specify a subgraph as the body of the iteration to specify how to calculate the update, after the iteration body process the whole dataset for one time (namely one Epoch). 


Per-Round v.s. All-Rounds Semantics

From the perspective 



Besides, the previous DataStream and DataSet iteration APIs also have some caveats to support algorithm implementation:

...