This page is meant as a template for writing a FLIP. To create a FLIP choose Tools->Copy on this page and modify with your content and replace the heading with the next FLIP number and a description of your issue. Replace anything in italics with your own description.

Status

Current state: Under Discussion

Discussion thread: here (<- link to https://mail-archives.apache.org/mod_mbox/flink-dev/)

JIRA: here (<- link to https://issues.apache.org/jira/browse/FLINK-XXXX)

Released: <Flink Version>

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Kubernetes (k8s) has become the predominant deployment platform for Flink. Over the past several years, a number of integrations have been developed that aim to help operationalise mission-critical Flink applications on k8s.

Flink comes with built-in k8s support, also referred to as Flink native k8s [2, 3, 4], providing an alternative to the traditional standalone deployment mode. Independently a number of k8s operators have been developed outside of Apache Flink, including [5, 6]. These implementations are not under the umbrella of a neutral entity like the ASF and as a result, tend to lack wider community participation and projects go stale after the maintainers shift focus.

However, the operator concept is central to a Kubernetes native deployment experience. It allows to manage Flink applications and their lifecycle through k8s tooling like kubectl. A key feature of an operator is the automation of application upgrades, which cannot be achieved through the Flink native (embedded) integration alone.

Public Interfaces

N/A

Proposed Changes

We are proposing to provide an Flink k8s operator implementation as part of Flink that is maintained by the community and closely integrated with the Flink ecosystem. This implementation will benefit from the extensive experience of Flink community members with large scale mission critical Flink deployments and learnings from existing operator implementations. As part of Flink, the operator will have a better chance to follow the development of Flink core, influence changes to Flink core and benefit from the established collaboration processes of the project.

Initial Feature Set

For the initial version of the operator we aim to target core aspects of job lifecycle management.

CRD to express Flink application (for details see CRD section below)
- External jar artifact fetcher support (s3, https etc.) via init container
- similar to https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#pod-template
- creates an empty session cluster, no application/job management
- the session cluster can be used to control jobs externally (like submission via REST API)

Supports all Flink configuration properties
Docker image
Upgrade policy (savepoint, stateless)
Restore policy (savepoint, latest externalized checkpoint, stateless)
jobmanager and taskmanager pod template (unrestricted k8s pod configuration)
Support explicit session cluster (no job management) and application mode

Create & deploy new Flink application

Empty state
From savepoint

Upgrade Flink application with or w/o savepoint on any CR change, including:

Flink configuration change
Job jar change
Docker image change

Pause/Resume Flink application
- the job will not continue its data processing
- the job will not be deleted from the cluster
- the job will release its resources back to the cluster (can be used by other jobs)

Stops job with savepoint, tracks savepoint/last checkpoint in CR status for resume.

Delete Flink application
Integrate with Flink Kubernetes HA module [4]

When selected, operator can obtain latest checkpoint from config map and does not depend on a potentially unavailable Flink job REST API
This should the default, but not a hard dependency

Support Flink UI ingress
CI/CD with operator Docker image artifact, publish image in dockerhub

Compatibility, Deprecation, and Migration Plan

What impact (if any) will there be on existing users?
If we are changing behavior how will we phase out the older behavior?
If we need special migration tools, describe them here.
When will we remove the existing behavior?

Test Plan

Describe in few sentences how the FLIP will be tested. We are mostly interested in system tests (since unit-tests are specific to implementation details). How will we know that the implementation works as expected? How will we know nothing broke?

Rejected Alternatives

If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.

Page tree

FLIP-212: Introduce Flink Kubernetes Operator