Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagejava
linenumberstrue
org.apache.flink.ml.iteration

/**
 * The callbacks defined below will be invoked only if the operator instance which implements this interface is used
 * within an iteration body.
 */
@PublicEvolving
public interface IterationListener<T> {
    /**
     * This callback is invoked every time the epoch watermark of this operator increments. The initial epoch watermark
     * of an operator is -10.
     *
     * The epochWatermark is the maximum integer that meets this requirement: every record that arrives at the operator
     * going forward should have an epoch larger than the epochWatermark. See Java docs in IterationUtils for how epoch
     * is determined for records ingested into the iteration body and for records emitted by operators within the
     * iteration body.
     *
     * If all inputs are bounded, the maximum epoch of all records ingested into this operator is used as the
     * epochWatermark parameter byfor the last invocation of this callback.
     *
     * @param epochWatermark The incremented epoch watermark.
     * @param context A context that allows emitting side output. The context is only valid during the invocation of
     *                this method.
     * @param collector The collector for returning result values.
     */
    void onEpochWatermarkIncremented(int epochWatermark, Context context, Collector<T> collector);

    /**
     * This callback is invoked after the execution of the iteration body has terminated.
     *
     * See Java doc of methods in IterationUtils for the termination conditions.
     *
     * @param context A context that allows emitting side output. The context is only valid during the invocation of
     *                this method.
     * @param collector The collector for returning result values.
     */
    void onIterationTermination(Context context, Collector<T> collector);

    /**
     * Information available in an invocation of the callbacks defined in the IterationProgressListener.
     */
    interface Context {
        /**
         * Emits a record to the side output identified by the {@link OutputTag}.
         *
         * @param outputTag the {@code OutputTag} that identifies the side output to emit to.
         * @param value The record to emit.
         */
        <X> void output(OutputTag<X> outputTag, X value);
    }
}

...

Code Block
languagejava
linenumberstrue
package org.apache.flink.ml.iteration;

/**
 * A helper class to apply {@link IterationBody} to data streams.
 */
@PublicEvolving
public class IterationUtils {
    /**
     * This method can use an iteration body to process records in unbounded data streams.
     *
     * This method invokes the iteration body with the following parameters:
     * 1) The 1st parameter is a list of input variable streams, which are created as the union of the initial variable
     * streams and the corresponding feedback variable streams (returned by the iteration body).
     * 2) The 2nd parameter is the data streams given to this method.
     *
     * The epoch values are determined as described below. See IterationListener for how the epoch values are used.
     * 1) All records in the initial variable streams has epoch=01.
     * 2) All records in the data streams has epoch=MAX_LONG. In this case, records in the data stream won't affect
     * any operator's lowepoch watermark.
     * 3) For any record emitted by this operator into a non-feedback stream, the epoch of this emitted record = the
     * epoch of the input record that triggers this emission. If this record is emitted by
 onEpochLWIncremented(), then
   * onEpochWatermarkIncremented(), *then the epoch of this record = incrementedEpochLW - 1epochWatermark.
     * 4) For any record emitted by this operator into a feedback variable stream, the epoch of the emitted record =
     * min(the epoch of the input record that triggers this emission, MAX_LONG - 1) + 1. If this record is emitted by
     * onEpochLWIncrementedonEpochWatermarkIncremented(), then the epoch of this record = epochWatermark incrementedEpochLW+ 1.
     *
     * The execution of the graph created by the iteration body will not terminate by itself. This is because at least
     * one of its data streams is unbounded.
     *
     * Required:
     * 1) All the init variable streams must be bounded.
     * 2) There is at least one unbounded stream in the data streams list.
     * 3) The parallelism of any stream in the initial variable streams must equal the parallelism of the stream at the
     * same index of the feedback variable streams returned by the IterationBody.
     *
     * @param initVariableStreams The initial variable streams. These streams will be merged with the feedback variable
     *                            streams before being used as the 1st parameter to invoke the iteration body.
     * @param dataStreams The data streams. These streams will be used as the 2nd parameter to invoke the iteration
     *                    body.
     * @param body The computation logic which takes variable/data streams and returns variable/output streams.
     * @return The list of output streams returned by the iteration boy.
     */
    static DataStreamList iterateUnboundedStreams(DataStreamList initVariableStreams, DataStreamList dataStreams, IterationBody body) {...}

    /**
     * This method can use an iteration body to process records in some bounded data streams iteratively until a
     * termination criteria is reached (e.g. the given number of rounds is completed or no further variable update is
     * needed). Because this method does not replay records in the data streams, the iteration body needs to cache those
     * records in order to visit those records repeatedly.
     *
     * This method invokes the iteration body with the following parameters:
     * 1) The 1st parameter is a list of input variable streams, which are created as the union of the initial variable
     * streams and the corresponding feedback variable streams (returned by the iteration body).
     * 2) The 2nd parameter is the data streams given to this method.
     *
     * The epoch values are determined as described below. See IterationListener for how the epoch values are used.
     * 1) All records in the initial variable streams has epoch=01.
     * 2) All records in the data streams has epoch=01.
     * 3) For any record emitted by this operator into a non-feedback stream, the epoch of this emitted record = the
     * epoch of the input record that triggers this emission. If this record is emitted by
 onEpochLWIncremented(), then
   * onEpochWatermarkIncremented(), *then the epoch of this record = incrementedEpochLW - 1epochWatermark.
     * 4) For any record emitted by this operator into a feedback variable stream, the epoch of the emitted record = the
     * epoch of the input record that triggers this emission + 1. If this record is emitted by onEpochLWIncremented(),
     * onEpochWatermarkIncremented(), then the epoch of this record = epochWatermark + incrementedEpochLW1.
     *
     * Suppose there is a coordinator operator which takes all feedback variable streams (emitted by the iteration body)
     * and the termination criteria stream (if not null) as inputs. The execution of the graph created by the
     * iteration body will terminate when all input streams have been fully AND any of the following conditions is met:
     * consumed:
     * 1) The termination criteria stream is not null. And the coordinator operator has not observed any new value from
     * the termination criteria stream between two consecutive onEpochLWIncrementedonEpochWatermarkIncremented invocations.
     * 2) The coordinator operator has not observed any new value from any feedback variable stream between two
     * consecutive onEpochLWIncrementedonEpochWatermarkIncremented invocations.
     *
     * Required:
     * 1) All the init variable streams and the data streams must be bounded.
     * 2) The parallelism of any stream in the initial variable streams must equal the parallelism of the stream at the
     * same index of the feedback variable streams returned by the IterationBody.
     *
     * @param initVariableStreams The initial variable streams. These streams will be merged with the feedback variable
     *                            streams before being used as the 1st parameter to invoke the iteration body.
     * @param dataStreams The data streams. These streams will be used as the 2nd parameter to invoke the iteration
     *                    body.
     * @param body The computation logic which takes variable/data streams and returns variable/output streams.
     * @return The list of output streams returned by the iteration boy.
     */
    static DataStreamList iterateBoundedStreamsUntilTermination(DataStreamList initVariableStreams, DataStreamList dataStreams, IterationBody body) {...}

    /**
     * This method can use an iteration body to process records in some bounded data streams iteratively until a
     * termination criteria is reached (e.g. the given number of rounds is completed or no further variable update is
     * needed). Because this method replays records in the data streams, the iteration body does not need to cache those
     * records to visit those records repeatedly.
     *
     * This method invokes the iteration body with the following parameters:
     * 1) The 1st parameter is a list of input variable streams, which are created as the union of the initial variable
     * streams and the corresponding feedback variable streams (returned by the iteration body).
     * 2) The 2nd parameter is a list of replayed data streams, which are created by replaying the initial data streams
     * round by round until the iteration terminates. The records in the Nth round will be emitted into the iteration
     * body only if the low watermark of the first operator in the iteration body >= N - 1.
     *
     * The epoch values are determined as described below. See IterationListener for how the epoch values are used.
     * 1) All records in the initial variable streams has epoch=01.
     * 2) The records from the initial data streams will be replayed round by round into the iteration body. The records
     * in the first round have epoch=01. And records in the Nth round have epoch = N - 1.
     * 3) For any record emitted by this operator into a non-feedback stream, the epoch of this emitted record = the
     * epoch of the input record that triggers this emission. If this record is emitted by
 onEpochLWIncremented(), then
   * onEpochWatermarkIncremented(), *then the epoch of this record = incrementedEpochLW - 1epochWatermark.
     * 4) For any record emitted by this operator into a feedback stream, the epoch of the emitted record = the epoch
     * of the input record that triggers this emission + 1. If this record is emitted by onEpochLWIncrementedonEpochWatermarkIncremented(), then
     * then the epoch of this record = epochWatermark + incrementedEpochLW1.
     *
     * Suppose there is a coordinator operator which takes all feedback variable streams (emitted by the iteration body)
     * and the termination criteria stream (if not null) as inputs. The execution of the graph created by the
     * iteration body will terminate when all input streams have been fully consumed AND any of the following conditions
     * is met:
     * 1) The termination criteria stream is not null. And the coordinator operator has not observed any new value from
     * the termination criteria stream between two consecutive onEpochLWIncrementedonEpochWatermarkIncremented invocations.
     * 2) The coordinator operator has not observed any new value from any feedback variable stream between two
     * consecutive onEpochLWIncrementedonEpochWatermarkIncremented invocations.
     *
     * Required:
     * 1) All the init variable streams and the data streams must be bounded.
     * 2) The parallelism of any stream in the initial variable streams must equal the parallelism of the stream at the
     * same index of the feedback variable streams returned by the IterationBody.
     *
     * @param initVariableStreams The initial variable streams. These streams will be merged with the feedback variable
     *                            streams before being used as the 1st parameter to invoke the iteration body.
     * @param initDataStreams The initial data streams. Records from these streams will be repeatedly replayed and used
     *                        as the 2nd parameter to invoke the iteration body.
     * @param body The computation logic which takes variable/data streams and returns variable/output streams.
     * @return The list of output streams returned by the iteration boy.
     */
    static DataStreamList iterateAndReplayBoundedStreamsUntilTermination(DataStreamList initVariableStreams, DataStreamList initDataStreams, IterationBody body) {...}

}


5) Add the DataStreamList class.

...

7) Support for synchronous iteration.

...

Definition of sync-mode

An iterative algorithm is run in sync-mode if there exists

...

global epoch, such that

...

at the time a given operator computes its output for the Nth epoch, this operator has received exactly the following records from its input edges:

  • We define a feedback loop as a circle composed of exactly 1 feedback edge and arbitrary number of non-feedback edges.
  • If an edge is a non-feedback input edge and this edge is part of a feedback loop, then this operator has received all records emitted on this edge for the Nth epoch, without receiving any record for the (N+1)th epoch.
  • If an edge is a feedback input edge and this edge is part of a feedback loop, then this operator has received all records emitted on this edge for the (N-1)th epoch, without receiving any record for the Nth epoch.


Solution to run an iterative algorithm in the sync mode

An iterative algorithm will be run in the sync mode if its IterationBody meets the following requirements:

  • When any operator within the IterationBody receives values from its input edges, this operator does not immediately emit records to its output.
  • Operators inside the IterationBody only compute and emit records to their outputs in the onEpochWatermarkIncremented(...) callback. The emitted records should be computed based on the values received from the input edges up to the invocation of this callback.

Proof

In the following, we will prove that the solution described above could enforce the sync-mode execution. Note that the calculation of the record's epoch and the semantics of onEpochWatermarkIncremented(...) are described in the Java doc of the corresponding APIs.

Lemma-1: For any operator OpB defined in the IterationBody, at the time its Nth invocation of onEpochWatermarkIncremented(...) starts, it is guaranteed that:

  • If an input edge is a non-feedback edge from OpA, then OpA's Nth invocation of onEpochWatermarkIncremented(...) has been completed.
  • If an input edge is a feedback edge from OpA, then OpA's (N-1)th invocation of onEpochWatermarkIncremented(...) has been completed.

Let's prove the lemma-1 by contradiction:

  • At the time the OpB's Nth invocation starts, its epoch watermark has incremented to N, which means OpB will no longer receive any record with epoch <= N.
  • Suppose there is a non-feedback edge from OpA AND OpA's Nth invocation has not been completed. Then when OpA's Nth invocation completes, OpA can generate a record with epoch=N and send it to OpB via this non-feedback edge, which contradicts the guarantee described above.
  • Suppose there is a feedback edge from OpA AND OpA's (N-1)th invocation has not been completed. Then when OpA's (N-1)th invocation completes, OpA can generate a record with epoch=N and send it to OpB via this feedback edge, which contradicts the guarantee described above.


Lemma-2: For any operator OpB defined in the IterationBody, at the time its Nth invocation of onEpochWatermarkIncremented(...) starts, it is guaranteed that:

  • If an edge is a non-feedback input edge from OpA and this edge is part of a feedback loop, then OpA's (N+1)th invocation of onEpochWatermarkIncremented(...) has not started.
  • If an edge is a feedback input edge from OpA and this edge is part of a feedback loop, then OpA's Nth invocation of onEpochWatermarkIncremented(...) has not started.

Let's prove this lemma by contradiction:

  • Suppose there is a non-feedback edge from OpA, this edge is part of a feedback loop, and OpA's (N+1)th invocation has started. Since this non-feedback edge is part of a feedback loop, there is a backward path from OpA to OpB with exactly 1 feedback edge on this path. By applying the lemma-1 recursively for operators on this path, we can tell that OpB's Nth invocation has been completed. This contradicts the assumption that OpB's Nth invocation just started.
  • Suppose there is a feedback edge from OpA, this edge is part of a feedback loop, and OpA's Nth invocation has started. Since this feedback edge is part of feedback loop, there is a backward path from OpA to OpB with no feedback edge on this path. By applying lemma-1 recursively for operators on this path, we can tell that OpB's Nth invocation has been completed. This contradicts the assumption that OpB's Nth invocation just started.

Let's now prove that the sync-mode is achieved.

  • For any operator in the IterationBody, we define its output for the Nth epoch as the output emitted by the Nth invocation of onEpochWatermarkIncremented(). This definition is well-defined because operators only emit records in onEpochWatermarkIncremented().
  • At the time an operator OpB computes its output for the Nth epoch, this operator must have received exactly the following records from its input edges:
    • Suppose an edge is a non-feedback input edge from OpA and this edge is part of a feedback loop. It follows that OpA has emitted records for its Nth epoch (by lemma-1) and has not started to emit records for its (N+1)th epoch (by lemma-2).
    • Suppose an edge is a feedback input edge from OpA and this edge is part of a feedback loop. It follows that OpA has emitted records for its (N-1)th epoch (by lemma-1) and has not started to emit records for its Nth epoch (by lemma-2).


8) Run iterative algorithm without dumping all user-provided data streams to disk.

As mentioned in the motivation section, the existing DataSet::iterate() always dump the user-provided data streams to disk so that it can replay the data streams regardless of the size of those data streams. Since this is the only available API to do iteration on bounded data streams, there is no way for algorithm developer to get rid of this performance overhead.

In comparison, this FLIP provides the iterateBoundedStreamsUntilTermination(...) for users to run an iteration body without having this performance overhead. Developers have the freedom to optimize the performance based on its algorithm and data size, e.g. cache data in memory in a more compact format.


Example Usages

Offline Training with Bounded Iteration

We would like to first show the usage of the bounded iteration with the linear regression case: the model is Y = XA, and we would like to acquire the best estimation of A with the SGD algorithm. To simplify we assume the parameters could be held in the memory of one task.

The job graph of the algorithm could be shown in the Figure 3: in each round, we use the latest parameters to calculate the update to the parameters: ΔA = ∑(Y - XA)X. To achieve this, the Parameters vertex would broadcast the latest parameters to the Train vertex. Each subtask of the Train vertex holds a part of dataset. Follow the

The proposed API can be used to implement both synchronous and asynchronous iterative algorithms. Here are the guidelines of how to achieve these two modes respectively.

Users can do the following to implement an algorithm in the async mode:

  • Emit records to the feedback variable streams in the callback that is invoked when the IterationBody receives new records from the input variable streams.
  • Provide an empty implementation for the onEpochWatermarkIncremented(...).

Users can do the following to implement an algorithm in the sync mode:

  • In the callback that is invoked when the IterationBody receives new records from the input variable streams, do not immediately emit records to the feedback variable streams. Just buffer the inputs or update internal states.
  • Emit records to the feedback variable streams in the onEpochWatermarkIncremented(...) callback.
  • Make sure every feedback edge is part of a circle that does not involve other feedback edge.

Here is the reason why the iteration will be synchronous:

To be explained. 

8) Replay data for iteration without requiring the runtime to dump the user-provided data streams to disk.

To be explained.

Example Usages

Offline Training with Bounded Iteration

We would like to first show the usage of the bounded iteration with the linear regression case: the model is Y = XA, and we would like to acquire the best estimation of A with the SGD algorithm. To simplify we assume the parameters could be held in the memory of one task.

The job graph of the algorithm could be shown in the Figure 3: in each round, we use the latest parameters to calculate the update to the parameters: ΔA = ∑(Y - XA)X. To achieve this, the Parameters vertex would broadcast the latest parameters to the Train vertex. Each subtask of the Train vertex holds a part of dataset. Follow the sprite of SGD, it would sample a small batch of training records, and calculate the update with the above equation. Then the Train vertex emit ΔA to the Parameters node to update the parameters.

...

Compatibility, Deprecation, and Migration Plan

TBD

Rejected Alternatives

Naiad has proposed a unified model for watermark mechanism (namely progress tracking outside of the iteration) and the progress tracking inside the iteration. It extends the event time and watermark to be a vector (long timestamp, int[] rounds) and implements a vectorized alignment algorithm. Although Naiad provides an elegant model, the direct implementation on Flink would requires a large amount of modification to the flink runtime, which would cause a lot of complexity and maintenance overhead.  Thus we would choose to implement a simplified version on top of FLINK, as a part of the flink-ml library.

The following APIs will be deprecated and removed in the future Flink release:

  • The entire DataSet class. See FLIP-131 for its motivation and the migration plan. The deprecation of DataSet::iterate(...) proposed by this FLIP is covered by FLIP-131.
  • The DataStream::iterate(...) and DataStream::iterate(long).

The proposed removal of DataStream::iterate(...) and DataStream::iterate(long) is a backward incompatible change. However, we believe that there is not wide-spread usage of these two APIs due to the issues described in FLIP-15.

Users will need to re-write their application code in order to migrate from the existing iterative APIs to the proposed APIs. We expect that the APIs proposed in this FLIP can support all use-cases supported by the existing iterative APIs.For the iteration DAG build graph, it would be more simpler if we could directly refer to the data stream variables outside of the closure of iteration body. However, since we need to make the iteration DAG creation first happen in the mock execution environment, we could not use these variables directly, otherwise we would directly modify the real environment and won't have chance to add wrappers to the operators.