Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Previously Flink supported bounded iteration with DataSet API and supported the unbounded iteration with DataStream API. However, since Flink aims to deprecate the DataSet API and the iteration in the DataStream API is rather incomplete, thus we would require to re-implement a new iteration library in the Flink-ml repository to support the algorithms. 

The Goal

In general a ML algorithm would update the model according to the data in iteration until the model is converged. According to the granularity of the dataset used to update the model, in general ML algorithms could be classified into two types:

  1. Epoch-based: each epoch means the algorithm goes through all the training dataset. The epoch-based algorithm must work with the bounded dataset. 
  2. Batch-based: each batch means the algorithm samples a subset from all the records and used the sampled records to update the model. The batch-based algorithm could be work with the bounded or unbounded dataset.

In a distributed settings, the dataset might be partitioned onto multiple subtasks, then for each epoch we refer to each subtask goes through all the assigned records, and for batch we refer to each subtask sample from its assigned records. When update the model with the referred data in a distributed settings, there are further two styles:

  1. Synchronous: In synchronous pattern the model must wait for the updates from all the subtasks before it could be used in the computation of the next update in each subtask. 
  2. Asynchronous: In asynchronous pattern the model could directly apply the updates from some subtasks, and uses the updated value in the following computation immediately. 

In general the synchronous pattern would have higher accuracy and the asynchronous pattern would convergent faster. 

Based on the above dimensions, the algorithms could be classified into the following types:

TypeData GranularitySynchronization Pattern Bounded / Unbounded













According to how the model is updated, the algorithms could be classified into the following types:

  1. Offline non-SGD-based algorithm: Algorithms like K-Means, Apriori do not depends on the SGD algorithm. It would calculate the update to the model according to the whole dataset and in a distributed implementation the model would be updated synchronously. 
  2. Offline SGD-based algorithm: 


Besides, the previous DataStream and DataSet iteration APIs also have some caveats to support algorithm implementation:

...