Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

8) There is no API provided by the Estimator/Transformer interface to validate the schema consistency of a Pipeline. Users would have to instantiate Tables (with I/O logics) and run fit/transform to know whether the stages in the Pipeline are compatible with each other.

Background

Note: Readers who are familiar with the existing Estimator/Transformer/Pipeline APIs can skip this section.

The design proposed in this doc is built on top of the design proposed in FLIP-39: Flink ML pipeline and ML libs, which in turn is motivated by the Estimator/Transformer/Pipeline design proposed in Scikit-learn. Please feel free to read FLIP-39 and Scikit-learn paper for more detail. In this section, we explain the key background information to help better understand the motivation and benefits of the proposed design, as well as the definition of some terminology.


1) What is the motivation of defining the Estimator/Transformer interfaces? How are these interfaces related to machine learning?

We expect most classic machine learning algorithms to have a training logic and inference logic. The training logic of the algorithm reads training data (e.g. labeled data) and updates its variables. The inference logic of the algorithm uses the updated variables to make prediction (e.g. label) for the unseen data.

The inference logic of this algorithm could be represented as a subclass of Transformer (say TransformerA). The Transformer subclass has a transform() method, which reads data (as Table) and outputs the prediction result (as Table).

The training logic of this algorithm could be represented as a subclass of Estimator (say EstimatorA). The Estimator subclass has a fit() method, which reads data (as Table) and outputs an instance of the TransformerA (defined above). The TransformerA instance contains the updated variables of this algorithm and can be used to make prediction for the unseen data.

Then, in order to do training and prediction using this algorithm, a user could write the following code:

Code Block
languagejava
Estimator estimator = new EstimatorA(...)

Table prediction_result = estimator.fit(training_data).transform(inference_data)


2) How does Pipeline work? What is the motivation of the Pipeline class?

The Pipeline is essentially created and used as a linear chain of Estimator/Transformer (referred to as stage below). And it is used as an Estimator (since it implements Estimator). 

The execution of Pipeline::fit is a bit more complicated than it looks, since the output of an Estimator stage is not used directly as the input of the next stage in the Pipeline.

We highlight the key execution detail of Pipeline::fit below:

  • The input of Pipeline::fit is given to the first stage in the Pipeline.
  • If a stage is Transformer, the given input is used to invoke Transformer::transform(...), whose output is used as the input of the next stage.
  • If a stage is Estimator, the given input is used to invoke Estimator::fit(...), which produces a Transformer instance. The same input is then given to the Transformer::transform of the fitted instance, whose output is used as the input of the next stage.
  • Eventually, any Estimator stage of this Pipeline will produce a Transformer, which are combined with the original Transformer stage of the Pipeline into a linear chain of Transformers. This linear chain of Transformers are composed into an instance of Transformer.

The motivation of the Pipeline class is to enable user to do the following:

  • Compose an Estimator from a linear chain of Estimator/Transformer
  • Use the Transformer instance fitted by this linear chain of Estimator/Transformer without additionally defining this Transformer instance (which should itself be a linear chain of Transformer) again.

It is important to make the following observation: if we don't provide the Pipeline class, users can still accomplish the same use-cases targeted by Pipeline by explicitly writing the training logic and inference logic separately using Estimator/Transformer APIs. But users would have to write this chain of Estimator/Transformer twice.

Public Interfaces

This FLIP proposes quite a few changes and addition to the existing Flink ML APIs. We first describe the proposed API additions and changes, followed by the API code of interfaces and classes after making the proposed changes.

...