Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Based on the above dimensions, the algorithms could be classified into the following types:

TypeData GranularitySynchronization Pattern Bounded / UnboundedExamples
Non-SGD-basedEpochMostly SynchronousBoundedK-Means, Apriori, Decision Tree, Random Walk

SGD-Based Synchronous Offline algorithm

Batch → Epoch*SynchronousBoundedLinear Regression, Logistic Regression, Deep Learning algorithms
SGD-Based Asynchronous Offline algorithmBatch → Epoch*AsynchronousBoundedSame to the above
SGD-Based Asynchronous Online algorithmBatchSynchronousUnboundedOnline version of the above algorithm
SGD-Based Asynchronous Online algorithmBatchAsynchronousUnboundedOnline version of the above algorithm

*Although SGD-based algorithms are also batch-based, they could be implemented with an Epoch-based method if intermediate state is allowed: the subtasks could sample a batch from all the records from the position of the last batch. 

...

If we directly copy the current implementation of iteration on the DataSet And DataStream API, we would still meet with some problem, thus we would like to have some optimization to the existing iteration functionality.

The Iteration Body and Round Semantics and Synchronization

At the iteration level, we would need the corresponding concept corresponding to Epoch and Batch. We would call processing one epoch as a round: users would specify a subgraph as the body of the iteration to specify how to calculate the update, after the iteration body process the whole dataset for one time (namely one Epoch). Apparently the round is meaningful only for the bounded cases.

Per-Round v.s. All-Rounds Semantics

How users could specify the iteration body ? If we first consider the bounded cases, there are two options

  1. Per-round: Users specify a subgraph, and for each round, the framework would recreate the operators and do the same computation.
  2. All-rounds: Users specify a subgraph, and the operators inside the subgraph would process the epochs of all the rounds. 

The DataSet iteration choose the per-round semantics. to support this semantics, in addition to re-create operators for each round, the framework also needs:

  1. For the inputs outside the iteration, 


The benefits of this method is that writing an iteration body is no difference from constructing a DAG outside of the iteration. 

Synchronization

Since for the bounded dataset, all the algorithms, to the best of out extend, are all able to be converted into epoch-based algorithms, thus we could only support the synchronization between epoch, namely between rounds.

How to From the perspective 


Besides, the previous DataStream and DataSet iteration APIs also have some caveats to support algorithm implementation:

...