It is important to make the following observation: if we don't provide the Pipeline class, users can still accomplish the same use-cases targeted by Pipeline by explicitly writing the training logic and inference logic separately using Estimator/Transformer APIs. But users would have to construct this chain of Estimator/Transformer twice (for training and inference respectively).

Design Principles

Multiple choices exist to address the use-cases targeted by this design doc. In the following, we explain the design principles followed by the proposed design, to hopefully make the understanding of the design choices more intuitive.

1) When the new use-case can be supported by just extending the arity of an existing API, we prefer to extend the arity of this API instead of adding a new class.

As a result of this philosophy, in order to support algorithms which can have multiple inputs and multiple outputs, we choose to extend Transformer::transform and Estimator::fit to take multiple Tables as inputs. And we also extend Transformer::transform to return multiple Tables as outputs.

An alternative solution is to add new classes, e.g. MultiInputTransformer, which has a transform(...) method that takes multiple input Tables and return multiple output Tables. In comparison to the proposed approach, this approach increases the number of classes that users have to deal with.

2) As much as possible, the API design should allow users to address the new use-case while still enjoying the existing benefits.

As described in the Background Section, the existing Pipeline class allows users to compose an Estimator from a linear chain of Estimator/Transformer, without requiring users to specify this linear chain twice. We consider this to be one of the most important feature provided by the existing Scikit-learn/Spark/Flink ML API.

As a result of this philosophy, we believe it is important/intuitive to provide similar benefit as the existing Pipeline class, while allowing users to compose Estimator from DAG of Estimator/Transformer.

Therefore, this design doc proposes to add the Graph/GraphTransformer/GraphBuilder classes to provide the following capability:

Allow users to compose an Estimator from a DAG of Estimator/Transformer, without requiring users to specify this DAG twice

Public Interfaces

This FLIP proposes quite a few changes and additions to the existing Flink ML APIs. We first describe the proposed API additions and changes, followed by the API code of interfaces and classes after making the proposed changes.

API additions and changes

Here we list the additions and the changes to the Flink ML API.

The following changes are the most important changes proposed by this doc:

1) Added the AlgoOperator class. AlgoOperator class has the same interface as the existing Transformer (i.e. has the transform method).

This change address the need to encode a generic multi-input multi-output machine learning function.

2) Updated Transformer/Estimator to take list of tables as inputs and return list of tables as output.

This change addresses the use-cases described in the motivation section, e.g. a graph embedding Estimator needs to take 2 tables as inputs.

3) Added setStateStreams and getStateStreams to the Transformer interface.

This change addresses the use-cases described in the motivation section, where a running Transformer needs to ingest the model state streams emitted by a Estimator, which could be running on a different machine.

4) Removed the methods PipelineStage::toJson and PipelineStage::loadJson. Add methods save(...) and load(...) to the Stage interface.

...