Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

It is important to make the following observation: if we don't provide the Pipeline class, users can still accomplish the same use-cases targeted by Pipeline by explicitly writing the training logic and inference logic separately using Estimator/Transformer APIs. But users would have to construct this chain of Estimator/Transformer twice (for training and inference respectively).

Design

...

Principles

Multiple choices exist to address the use-cases targeted by this design doc. In the following, we explain the design principle principles followed by the proposed design, to hopefully make the understanding of the design choices more intuitive.

1) As much as possible, the API design should allow users to address the When the new use-case while still enjoying the existing benefits.

For example, the existing Pipeline class allows users to compose an Estimator from a linear chain of Estimator/Transformer, without requiring users to specify this linear chain twice (see Background Section for more detail).

can be supported by just extending the arity of an existing API, we prefer to extend the arity of this API instead of adding a new class.

As a result of this philosophy, in order to support algorithms which can have multiple inputs and multiple outputs, we choose to extend Transformer::transform and Estimator::fit to take multiple Tables as inputs. And we also extend Transformer::transform to return multiple Tables as outputs.

An alternative solution is to add new classes, e.g. MultiInputTransformer, which has a transform(...) method that takes multiple input Tables and return multiple output Tables. In comparison to the proposed approach, this approach increases the number of classes that users have to deal with.


2) As much as possible, the API design should allow users to address the new use-case while still enjoying the existing benefits.

As described in the Background Section, the existing Pipeline class allows users to compose an Estimator from a linear chain of Estimator/Transformer, without requiring users to specify this linear chain twice. We consider this to be one of the most important feature provided by the existing Scikit-learn/Spark/Flink ML API.

As a result of this philosophy, we believe it is important/intuitive to provide similar benefit as the existing Pipeline class, while allowing users to compose Estimator from DAG of Estimator/Transformer. Therefore, we propose to add the Graph/GraphModel/GraphBuilder classes to provide the following capabilitiesCorrespondingly, as we extend the Flink ML API to suppose DAG of Estimator/Transformer, we believe the APIs should provide this functionality with similar benefits:

  • Allow users to compose an Estimator from a DAG of Estimator/Transformer, without requiring users to specify this DAG twice

...