...

Such simple architecture allows to implement a lot of important model serving patterns including the following:

Guaranteed execution time. Assuming that we have several models with the fastest providing fixed execution time, it is possible to provide a model serving implementation with a fixed upper-limit on execution time, as long as that time is larger than the execution time of the simplest model
Consensus based model serving. Assuming that we have several models, we can implement model serving where prediction is the one returned by the majority of the models.
Quality based model serving. Assuming that we have an metric allowing us to evaluate the quality of model serving results, this approach allows us to pick the result with the best quality. It is, of course, possible to combine multiple feature, for example, consensus based model serving with the guaranteed execution time, where the consensus result is used when it completes within a given time interval.
"Canary" deployment, where some of requests are routed to the "new" executors.

Combining Speculative execution with "real time updatable" model serving, described earlier, leads to the following overall architecture:

...

Initial Implementation

An initial implementation examples are provided for this Flip:

Flink Model Server https://github.com/FlinkML/flink-modelServer is an implementation of the basic model serving. Implementation is provided in both Scala and Java. It Implements both key-base and partition-base joins and both PMML and Tensorflow.
Flink Speculative Model Server https://github.com/FlinkML/flink-speculative-modelServer is am implementation of a speculative model server in both Scala and Java. This is a little more limited - Speculative model server is not supported for partition-based approach. It also uses only TF.

Page tree