Page History

...

Based on the above dimensions, the algorithms could be classified into the following types:

Type	Data Granularity	Synchronization Pattern	Bounded / Unbounded	Examples
Non-SGD-based	Epoch	Mostly Synchronous	Bounded	K-Means, Apriori, Decision Tree, Random Walk
SGD-Based Synchronous Offline algorithm	Batch → Epoch*	Synchronous	Bounded	Linear Regression, Logistic Regression, Deep Learning algorithms
SGD-Based Asynchronous Offline algorithm	Batch → Epoch*	Asynchronous	Bounded	Same to the above
SGD-Based Asynchronous Online algorithm	Batch	Synchronous	Unbounded	Online version of the above algorithm
SGD-Based Asynchronous Online algorithm	Batch	Asynchronous	Unbounded	Online version of the above algorithm

*Although SGD-based algorithms are also batch-based, they could be implemented with an Epoch-based method if intermediate state is allowed: the subtasks could sample a batch from all the records from the position of the last batch.

...

If we directly copy the current implementation of iteration on the DataSet And DataStream API, we would still meet with some problem, thus we would like to have some optimization to the existing iteration functionality.

The Iteration Body and Round Semantics and Synchronization

At the iteration level, we would need the corresponding concept corresponding to Epoch and Batch. We would call processing one epoch as a round: users would specify a subgraph as the body of the iteration to specify how to calculate the update, after the iteration body process the whole dataset for one time (namely one Epoch). . Apparently the round is meaningful only for the bounded cases.

Per-Round v.s. All-Rounds Semantics

How users could specify the iteration body ? If we first consider the bounded cases, there are two options

Per-round: Users specify a subgraph, and for each round, the framework would recreate the operators and do the same computation.
All-rounds: Users specify a subgraph, and the operators inside the subgraph would process the epochs of all the rounds.

The DataSet iteration choose the per-round semantics. to support this semantics, in addition to re-create operators for each round, the framework also needs:

For the inputs outside the iteration,

The benefits of this method is that writing an iteration body is no difference from constructing a DAG outside of the iteration.

Synchronization

Since for the bounded dataset, all the algorithms, to the best of out extend, are all able to be converted into epoch-based algorithms, thus we could only support the synchronization between epoch, namely between rounds.

How to From the perspective

Besides, the previous DataStream and DataSet iteration APIs also have some caveats to support algorithm implementation:

...

Page tree

Versions Compared

Old Version 46

New Version 47

Key