Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

It is important to make the following observation: if we don't provide the Pipeline class, users can still accomplish the same use-cases targeted by Pipeline by explicitly writing the training logic and inference logic separately using Estimator/Transformer APIs. But users would have to construct this chain of Estimator/Transformer twice (for training and inference respectively).

Design Principles

Multiple choices exist to address the use-cases targeted by this design doc. In the following, we explain the design principles followed by the proposed design, to hopefully make the understanding of the design choices more intuitive.

1) When the new use-case can be supported by just extending the arity of an existing API, we prefer to extend the arity of this API instead of adding a new class.

As a result of this philosophy, in order to support algorithms which can have multiple inputs and multiple outputs, we choose to extend Transformer::transform and Estimator::fit to take multiple Tables as inputs. And we also extend Transformer::transform to return multiple Tables as outputs.

An alternative solution is to add new classes, e.g. MultiInputTransformer, which has a transform(...) method that takes multiple input Tables and return multiple output Tables. In comparison to the proposed approach, this approach increases the number of classes that users have to deal with.

2) As much as possible, the API design should allow users to address the new use-case while still enjoying the existing benefits.

As described in the Background Section, the existing Pipeline class allows users to compose an Estimator from a linear chain of Estimator/Transformer, without requiring users to specify this linear chain twice. We consider this to be one of the most important feature provided by the existing Scikit-learn/Spark/Flink ML API.

As a result of this philosophy, we believe it is important/intuitive to provide similar benefit as the existing Pipeline class, while allowing users to compose Estimator from DAG of Estimator/Transformer.

Therefore, this design doc proposes to add the Graph/GraphTransformer/GraphBuilder classes to provide the following capability:

  • Allow users to compose an Estimator from a DAG of Estimator/Transformer, without requiring users to specify this DAG twice

Public Interfaces

This FLIP proposes quite a few changes and additions to the existing Flink ML APIs. We first describe the proposed API additions and changes, followed by the API code of interfaces and classes after making the proposed changes.

API additions and changes

Here we list the additions and the changes to the Flink ML API.

The following changes are the most important changes proposed by this doc:

1) Added the AlgoOperator class. AlgoOperator class has the same interface as the existing Transformer (i.e. has the transform method).

This change address the need to encode a generic multi-input multi-output machine learning function. 

2) Updated Transformer/Estimator to take list of tables as inputs and return list of tables as output.

This change addresses the use-cases described in the motivation section, e.g. a graph embedding Estimator needs to take 2 tables as inputs.

3) Added setStateStreams and getStateStreams to the Transformer interface.

This change addresses the use-cases described in the motivation section, where a running Transformer needs to ingest the model state streams emitted by a Estimator, which could be running on a different machine.

4) Removed the methods PipelineStage::toJson and PipelineStage::loadJson. Add methods save(...) and load(...) to the Stage interface.

...

Public Interfaces

This FLIP proposes quite a few changes and additions to the existing Flink ML APIs. We first describe the proposed API additions and changes, followed by the API code of interfaces and classes after making the proposed changes.

API additions and changes

Here we list the additions and the changes to the Flink ML API.

The following changes are the most important changes proposed by this doc:

1) Added the AlgoOperator class. AlgoOperator class has the same interface as the existing Transformer (i.e. has the transform method).

This change address the need to encode a generic multi-input multi-output machine learning function. 

2) Updated Transformer/Estimator to take list of tables as inputs and return list of tables as output.

This change addresses the use-cases described in the motivation section, e.g. a graph embedding Estimator needs to take 2 tables as inputs.

3) Added setStateStreams and getStateStreams to the Transformer interface.

This change addresses the use-cases described in the motivation section, where a running Transformer needs to ingest the model state streams emitted by a Estimator, which could be running on a different machine.

4) Removed the methods PipelineStage::toJson and PipelineStage::loadJson. Add methods save(...) and load(...) to the Stage interface.


The following changes are relatively minor:

5) Removed TableEnvironment from the parameter list of fit/transform APIs.

This change simplifies the usage of fit/transform APIs.

6) Added pipelineTransformer and let Pipeline implement only the Estimator. Pipeline is no longer a Transformer.

This change makes the experience of using Pipeline consistent with the experience of using Estimator/Transformer, where a class is either an Estimator or a Transformer.

7) Removed Pipeline::appendStage from the Pipeline class.

This change makes the concept of Pipeline consistent with that of Graph/GraphBuilder. Neither Graph nor Pipeline provides the API to construct themselves.

8) Removed the Model interface. And renamed PipelineModel to PipelineTransformer.

This change simplifies the class hierarchy by removing a redundant class. It follows the philosophy of only adding complexity when we have explicit use-case for it.

9) Renamed PipelineStage to Stage and add the PublicEvolving tag to the Stage interface.

This change is reasonable because we will now compose Graph (not just Pipeline) using this class.

Interfaces and classes after the proposed API changes

The following code block shows the interface of Stage, Transformer, Estimator, Pipeline and pipelineTransformer after making the changes listed above.

Code Block
languagejava
/**
 * Base class for a stage in a Pipeline or Graph. The interface is only a concept, and does not have any actual
 * functionality. Its subclasses could be Estimator, Transformer or AlgoOperator. No other classes should inherit this
 * interface directly.
 *
 * <p>Each stage is with parameters, and requires a public empty constructor for restoration.
 *
 * @param <T> The class type of the Stage implementation itself.
 * @see WithParams
 */
@PublicEvolving
interface Stage<T extends Stage<T>> extends WithParams<T>, Serializable {
    /**
     * Saves this stage to the given path.
     */
    void save(String path);

    /**
     * Loads this stage from the given path.
     */
    void load(String path);
}

/**
 * A AlgoOperator is a Stage that takes a list of tables as inputs and produces a list of
 * tables as results. It can be used to encode a generic multi-input multi-output machine learning function.
 *
 * @param <T> The class type of the AlgoOperator implementation itself.
 */
@PublicEvolving
public interface AlgoOperator<T extends AlgoOperator<T>> extends Stage<T> {

5) Removed TableEnvironment from the parameter list of fit/transform APIs.

This change simplifies the usage of fit/transform APIs.

6) Added pipelineTransformer and let Pipeline implement only the Estimator. Pipeline is no longer a Transformer.

This change makes the experience of using Pipeline consistent with the experience of using Estimator/Transformer, where a class is either an Estimator or a Transformer.

7) Removed Pipeline::appendStage from the Pipeline class.

This change makes the concept of Pipeline consistent with that of Graph/GraphBuilder. Neither Graph nor Pipeline provides the API to construct themselves.

8) Removed the Model interface. And renamed PipelineModel to PipelineTransformer.

This change simplifies the class hierarchy by removing a redundant class. It follows the philosophy of only adding complexity when we have explicit use-case for it.

9) Renamed PipelineStage to Stage and add the PublicEvolving tag to the Stage interface.

This change is reasonable because we will now compose Graph (not just Pipeline) using this class.

Interfaces and classes after the proposed API changes

The following code block shows the interface of Stage, Transformer, Estimator, Pipeline and pipelineTransformer after making the changes listed above.

Code Block
languagejava
/**
 * Base class for a stage in a Pipeline or Graph. The interface is only a concept, and does not have any actual
 * functionality. Its subclasses could be Estimator, Transformer or AlgoOperator. No other classes should inherit this
 * interface directly.
 *
 * <p>Each stage is with parameters, and requires a public empty constructor for restoration.
 *
 * @param <T> The class type of the Stage implementation itself.
 * @see WithParams
 */
@PublicEvolving
interface Stage<T extends Stage<T>> extends WithParams<T>, Serializable {
    /**
     * Saves this stage to the given path.
     */
    void save(String path);

    /**
     * LoadsApplies thisthe stageAlgoOperator fromon the given path input tables, and returns the result tables.
     */
     void* load(String path);
}

/**
 * A AlgoOperator is a Stage that takes@param inputs a list of tables
     * @return a list of tables
 as inputs and produces a*/
 list of
 * tables as results. It can be used to encode a generic multi-input multi-output machine learning functionTable[] transform(Table... inputs);
}

/**
 * A Transformer is a AlgoOperator with additional support for state streams, which could be set by the Estimator that fitted
 * this Transformer. Unlike AlgoOperator, a Transformer is typically associated with an Estimator.
 *
 * @param <T> The class type of the AlgoOperatorTransformer implementation itself.
 */
@PublicEvolving
public interface AlgoOperator<TTransformer<T extends AlgoOperator<T>>Transformer<T>> extends Stage<T>AlgoOperator<T> {
    /**
     * AppliesUses the AlgoOperatorgiven onlist theof giventables inputto tables,update andinternal returnsstates. theThis resultcan tables.
be useful for   *e.g. online
     * @paramlearning inputswhere aan listEstimator offits tables
an infinite stream of training *samples @returnand astreams listthe of tablesmodel
     */
 diff data to Table[] transform(Table... inputs);
}

/**
 * A Transformer is a AlgoOperator with additional support for state streams, which could be set by the Estimator that fitted
 * this Transformer. Unlike AlgoOperator, a Transformer is typically associated with an Estimator.
 *
 * @param <T> The class type of the Transformer implementation itself.
 */
@PublicEvolving
public interface Transformer<T extends Transformer<T>> extends AlgoOperator<T> {this Transformer.
     *
     * <p>This method may be called at most once.
     *
     * @param inputs a list of tables
     */
    default void setStateStreams(Table... inputs) {
        throw new UnsupportedOperationException("this method is not implemented");
    }

    /**
     * UsesGets the givena list of tables torepresenting changes updateof internal states. Thisof can be useful for e.gthis Transformer. onlineThese
     * learningtables wheremight ancome Estimatorfrom fitsthe anEstimator infinitethat streaminstantiated ofthis trainingTransformer.
 samples and streams the model*
     * diff@return dataa tolist thisof Transformer.tables
     */
    default * <p>This method may be called at most once.
     *
     * @param inputs a list of tables
     */
    default void setStateStreams(Table... inputsTable[] getStateStreams() {
        throw new UnsupportedOperationException("this method is not implemented");
    }

    /}

/**
 * An Estimator is * Gets a listStage ofthat tablestakes representinga changeslist of internal states of this Transformer. These
     * tables might come from the Estimator that instantiated this Transformer.
     *
     * @return a list of tables
     */
    default Table[] getStateStreams() {
        throw new UnsupportedOperationException("this method is not implemented");
    }
}

/**
 * An Estimator is a Stage that takes a list of tables as inputs and produces a Transformer.
 *
 * @param <E> class type of the Estimator implementation itself.
 * @param <M> class type of the Transformer this Estimator produces.
 */
@PublicEvolving
public interface Estimator<E extends Estimator<E, M>, M extends Transformer<M>> extends Stage<E> {
    /**
     * Trains on the given inputs and produces a Transformer.
     *
     * @param inputs a list of tables
     * @return a Transformer
     */
    M fit(Table... inputs);
}

/**
 * A Pipeline acts as an Estimator. It consists of an ordered list of stages, each of which could be
 * an Estimator, Transformer or AlgoOperator.
 */
@PublicEvolving
public final class Pipeline implements Estimator<Pipeline, pipelineTransformer> {

    public Pipeline(List<Stage<?>> stages) {...}

    @Override
    public pipelineTransformer fit(Table... inputs) {...}

    /** Skipped a few methods, including the implementations of the Estimator APIs. */
}

/**
 * A pipelineTransformer acts as a Transformer. It consists of an ordered list of Transformers or AlgoOperators.
 */
@PublicEvolving
public final class pipelineTransformer implements Transformer<pipelineTransformer> {

    public pipelineTransformer(List<Transformer<?>> transformers) {...}

    /** Skipped a few methods, including the implementations of the Transformer APIs. */
}

Example Usage

In this section we provide examples code snippets to demonstrate how we can use the APIs proposed in this FLIP to address the use-cases in the motivation section.

Online learning by running Transformer and Estimator concurrently on different machines

Here is an online learning scenario:

  • We have an infinite stream of tagged data that can be used for training.
  • We have an algorithm that can be trained using this infinite stream of data. This algorithm (with its latest states/parameters) can be used to do inference. And the accuracy of the algorithm increases with the increasing amount of training data it has seen.
  • We would like to train this algorithm using the given data stream on clusterA. And uses this algorithm with the update-to-date states/parameters to do inference on 10 different web servers.

In order to address this use-case, we can write the training and inference logics of this algorithm into an EstimatorA class and a TransformerA class with the following API behaviors:

  • EstimatorA::fit takes a table as input and returns an instance of TransformerA. Before this method returns this transformerA, it calls transformerA.setStateStreams(state_table), where the state_table represents the stream of algorithm parameters changes produced by EstimatorA.
  • TransformerA::setStateStreams(...) takes a table as input. Its implementation reads the data from this table to continuously update its algorithm parameters.
  • TransformerA::getStateStreams(...) returns the same table instance that has been provided via TransformerA::setStateStreams(...).
  • TransformerA::transform takes a table as input and returns a table. The returned table represents the inference results.

...

tables as inputs and produces a Transformer.
 *
 * @param <E> class type of the Estimator implementation itself.
 * @param <M> class type of the Transformer this Estimator produces.
 */
@PublicEvolving
public interface Estimator<E extends Estimator<E, M>, M extends Transformer<M>> extends Stage<E> {
    /**
     * Trains on the given inputs and produces a Transformer.
     *
     * @param inputs a list of tables
     * @return a Transformer
     */
    M fit(Table... inputs);
}

/**
 * A Pipeline acts as an Estimator. It consists of an ordered list of stages, each of which could be
 * an Estimator, Transformer or AlgoOperator.
 */
@PublicEvolving
public final class Pipeline implements Estimator<Pipeline, pipelineTransformer> {

    public Pipeline(List<Stage<?>> stages) {...}

    @Override
    public pipelineTransformer fit(Table... inputs) {...}

    /** Skipped a few methods, including the implementations of the Estimator APIs. */
}

/**
 * A pipelineTransformer acts as a Transformer. It consists of an ordered list of Transformers or AlgoOperators.
 */
@PublicEvolving
public final class pipelineTransformer implements Transformer<pipelineTransformer> {

    public pipelineTransformer(List<Transformer<?>> transformers) {...}

    /** Skipped a few methods, including the implementations of the Transformer APIs. */
}


Example Usage

In this section we provide examples code snippets to demonstrate how we can use the APIs proposed in this FLIP to address the use-cases in the motivation section.

Online learning by running Transformer and Estimator concurrently on different machines

Here is an online learning scenario:

  • We have an infinite stream of tagged data that can be used for training.
  • We have an algorithm that can be trained using this infinite stream of data. This algorithm (with its latest states/parameters) can be used to do inference. And the accuracy of the algorithm increases with the increasing amount of training data it has seen.
  • We would like to train this algorithm using the given data stream on clusterA. And uses this algorithm with the update-to-date states/parameters to do inference on 10 different web servers.

In order to address this use-case, we can write the training and inference logics of this algorithm into an EstimatorA class and a TransformerA class with the following API behaviors:

  • EstimatorA::fit takes a table as input and returns an instance of TransformerA. Before this method returns this transformerA, it calls transformerA.setStateStreams(state_table), where the state_table represents the stream of algorithm parameters changes produced by EstimatorA.
  • TransformerA::setStateStreams(...) takes a table as input. Its implementation reads the data from this table to continuously update its algorithm parameters.
  • TransformerA::getStateStreams(...) returns the same table instance that has been provided via TransformerA::setStateStreams(...).
  • TransformerA::transform takes a table as input and returns a table. The returned table represents the inference results.


Here are the code snippets that address this use-case by using the proposed APIs.

First run the following code on clusterA:

Code Block
languagejava
void runTrainingOnClusterA(...) {
  // Creates the training stream from a Kafka topic.
  Table training_stream = ...;

  Estimator estimator = new EstimatorA(...);
  Transformer transformer = estimator.fit(training_stream);
  Table state_stream = transformer.getStateStreams()[0];

  // Writes the state_stream to a Kafka topicA.
  state_stream.sinkTable(...);
  // Saves transformer's state/metadata to a remote path.
  transformer.save(remote_path);

  // Executes the operators generated by the Estimator::fit(...), which reads from training_stream and writes to state_stream.
  env.execute()
}


Then run the following code on each web server:

Code Block
languagejava
void runInferenceOnWebServer(...) {
  // Creates the state stream from Kafka topicA which is written by the above code snippet. 
  Table state_stream = ...;
  // Creates the input stream that needs inference.
  Table input_stream = ...;

  Transformer transformer = new Transformer(...);
  transformer.load(remote_path);
  transformer.setStateStreams(new Table[]{state_stream});
  Table output_stream = transformer.transform(input_stream);

  // Do something with the output_stream.

  // Executes the operators generated by the Transformer::transform(...), which reads from state_stream to update its parameters. 
  // It also does inference on input_stream and produces results to the output_stream.
  env.execute()
}

First run the following code on clusterA:

Code Block
languagejava
void runTrainingOnClusterA(...) {
  // Creates the training stream from a Kafka topic.
  Table training_stream = ...;

  Estimator estimator = new EstimatorA(...);
  Transformer transformer = estimator.fit(training_stream);
  Table state_stream = transformer.getStateStreams()[0];

  // Writes the state_stream to a Kafka topicA.
  state_stream.sinkTable(...);
  // Saves transformer's state/metadata to a remote path.
  transformer.save(remote_path);

  // Executes the operators generated by the Estimator::fit(...), which reads from training_stream and writes to state_stream.
  env.execute()
}

Then run the following code on each web server:

Code Block
languagejava
void runInferenceOnWebServer(...) {
  // Creates the state stream from Kafka topicA which is written by the above code snippet. 
  Table state_stream = ...;
  // Creates the input stream that needs inference.
  Table input_stream = ...;

  Transformer transformer = new Transformer(...);
  transformer.load(remote_path);
  transformer.setStateStreams(new Table[]{state_stream});
  Table output_stream = transformer.transform(input_stream);

  // Do something with the output_stream.

  // Executes the operators generated by the Transformer::transform(...), which reads from state_stream to update its parameters. 
  // It also does inference on input_stream and produces results to the output_stream.
  env.execute()
}

Compose an Estimator from a chain of Estimator/Transformer whose input schemas are different from its fitted Transformer 

Suppose we have the following Estimator and Transformer classes where an Estimator's input schemas could be different from the input schema of its fitted Transformer:

  • TransformerA whose transform(...) takes 1 input table and has 1 output table.
  • EstimatorA whose fit(...) takes 2 input tables and returns an instance of TransformerA.
  • TransformerB whose transform(...) takes 1 input table and has 1 output table.

And we want to compose an Estimator (e.g. Graph) from the following DAG of Transformer/Estimator.

Image Removed

The resulting Graph::fit is expected to have the following behavior:

  • The method takes 2 input tables. Both tables are given to EstimatorA::fit.
  • EstimatorA fits the input tables and generates a TransformerA instance. The TransformerA instance takes 1 table input, which is different from the 2 tables given to the EstimatorA.
  • Returns a GraphTransformer instance which contains a TransformerA instance and a TransformerB instance, which are connected as a chain.

The fitted GraphTransformer is represented by the following DAG:

Image Removed

Notes:

  • The fitted GraphTransformer takes only 1 table as input whereas the Graph takes 2 tables as inputs.
  • The proposed APIs also support composing an Estimator from a DAG of Estimator/Transformer whose input schemas are different from its fitted Transformer. 

Here is the code snippet that addresses this use-case by using the proposed APIs:

Code Block
languagejava
GraphBuilder builder = new GraphBuilder();

// Creates nodes
Stage<?> stage1 = new EstimatorA();
Stage<?> stage2 = new TransformerB();
// Creates inputs
TableId estimatorInput1 = builder.createTableId();
TableId estimatorInput2 = builder.createTableId();
TableId transformerInput1 = builder.createTableId();

// Feeds inputs to nodes and gets outputs.
TableId output1 = builder.getOutputs(stage1, new TableId[] {estimatorInput1, estimatorInput2}, new TableId[] {transformerInput1})[0];
TableId output2 = builder.getOutputs(stage2, output1)[0];

// Specifies the ordered lists of estimator inputs, transformer inputs, outputs, input states and output states
// that will be used as the inputs/outputs of the corresponding Graph and GraphTransformer APIs.
TableId[] estimatorInputs = new TableId[] {estimatorInput1, estimatorInput2};
TableId[] transformerInputs = new TableId[] {transformerInput1};
TableId[] outputs = new TableId[] {output2};
TableId[] inputStates = new TableId[] {};
TableId[] outputStates = new TableId[] {};

// Generates the Graph instance.
Graph graph = builder.build(estimatorInputs, transformerInputs, outputs, inputStates, outputStates);
// The fit method takes 2 tables which are mapped to estimatorInput1 and estimatorInput2.
GraphTransformer transformer = graph.fit(...);
// The transform method takes 1 table which is mapped to transformerInput1.
Table[] results = transformer.transform(...);


Compatibility, Deprecation, and Migration Plan

...