This page is meant as a template for writing a FLIP. To create a FLIP choose Tools->Copy on this page and modify with your content and replace the heading with the next FLIP number and a description of your issue. Replace anything in italics with your own description.

Status

Current state: Drafting

Discussion thread: here (<- link to https://mail-archives.apache.org/mod_mbox/flink-dev/)

JIRA: here (<- link to https://issues.apache.org/jira/browse/FLINK-XXXX)

Released: <Flink Version>

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Fine-Grained Resource Management is one of Apache Flink’s roadmap features that the community has been trying to deliver. While FLIP-56 delivers the ability to allocate slots with respect to fine-grained resource requirements, the question of how to get those resource requirements remains unanswered. In this FLIP, we will discuss how the runtime interfaces for fine-grained resource requirements should look like, with respect to usability, flexibility and how resources are used in runtime.

Note: This FLIP mainly focuses on discussing and reasoning the design choices. The changes needed for the proposed design are straightforward.

Background on Fine-Grained Resource Management

Motivation

Flink currently adopts a coarse-grained resource management approach, where tasks are deployed into predefined, usually identical slots without the notion of how many resources each slot contains. With slot sharing, tasks in the same Slot Sharing Group (SSG) can be deployed into one slot regardless of how many resources each task/operator needs. In FLIP-56, we proposed fine-grained resource management, which leverages slots with different resources for task execution, with respect to the workload’s resource requirements.

For many jobs, using coarse-grained resource management and simply putting all tasks into one SSG works good enough, in terms of both resource utilization and usability.

For many streaming jobs that all tasks have the same parallelism, each slot will contain an entire pipeline. Ideally, all pipelines should use roughly the same resources, which can be satisfied easily by tuning the resources of the identical slots.
Resource consumption of tasks varies over time. When consumption of a task decreases, the extra resources can be used by another task whose consumption is increasing. This, known as the peak shaving and valley filling effect, reduces the overall resource needed.

However, there are cases where coarse-grained resource management does not work well.

Tasks may have different parallelisms. Sometimes, such different parallelisms cannot be avoided. E.g., the parallelism of source/sink/lookup tasks might be constrained by the partitions and IO load of the external upstream/downstream system. In such cases, slots with fewer tasks would need fewer resources than those with the entire pipeline of tasks.
Sometimes the resource needed for the entire pipeline might be too much to be put into a single slot/taskmanager. In such cases, the pipeline needs to be split into multiple SSGs, which may not always have the same resource requirement.
For batch jobs, not all the tasks can be executed at the same time. Thus, the instantaneous resource requirement of the pipeline changes over time.

Trying to execute all tasks with identical slots can result in non-optimal resource utilization. The resource of the identical slots has to be able to fulfill the highest resource requirement, which will be wasteful for other requirements. When expensive external resources like GPU are involved, such waste can become even harder to afford.

Therefore, fine-grained resource management is needed, which leverages slots of different resources to improve resource utilization in such scenarios.

Current Condition

Currently, most of the slot allocation and scheduling logic proposed in FLIP-56 has already been implemented, except for a slot manager plugin which is still in progress (FLINK-20835). The major missing part is user interfaces for specifying resource requirements for a job.

There are some ancient codes for setting operator resources on Transformation and aggregating them to generate slot requests. However, these codes are never really used and there are no APIs exposed to users. Most importantly, we are not sure letting users specify the operator level resource requirements and aggregating them at runtime is the right approach, which will be discussed in subsequence sections.

Scope

This FLIP proposes Slot Sharing Group (SSG) based runtime interfaces for specifying fine-grained resource requirements. To be specific, we discuss how resource requirements are specified at the Transformation layer and leveraged afterward, which covers the common path of both Table/SQL API and DataStream API workloads.

The end-user interfaces for specifying resource requirements are excluded from the scope of this FLIP, for the following reasons.

The fine-grained resource management is not end-to-end ready. We believe it should be the last step to activate the feature by exposing the user APIs.
Different development APIs may expose the interfaces for specifying resource requirements differently. It requires more in-depth discussions with the component experts to decide how this feature should be integrated by the development APIs.

The following examples are only some preliminary ideas for demonstrating how the user interfaces may look like in different development APIs.

For DataStream API, there are already interfaces for setting SSGs for operators. Based on this, we can introduce new interfaces to specify resource requirements for the SSGs directly.
For Table API & SQL, since neither the concept of operator nor SSG is exposed, the planner probably should generate the SSG resource requirements, exposing only a few configuration knobs to users.

Granularity of Fine-Grained Resource Requirements

In this section, we discuss at what granularity should the fine-grained resource requirements be specified, which is the most fundamental question that needs to be answered for designing the runtime interfaces.

To be specific, we discuss the pros and cons of the three design options: specifying resource requirements for each operator, task or slot sharing group.

User Story

Before diving into the design options, we’d like to make clear the user story of fine-grained resource management, which helps to understand how the pros and cons of each design option affect our target use cases.

We believe fine-grained resource management is not a replacement of existing approaches, but an extension to the span of user involvement for controlling Flink’s resource usage. Users may choose how much they’d like to involve, with respect to their expertise and requirements.

The least involved option is to leverage the out-of-box coarse-grained resource configurations. It should work in most simple use cases, especially for beginners trying out Flink. However, resource utilization is usually not optimal.
In production, it usually requires more user involvement, to specify the operator parallelisms, configure coarse-grained slot/taskmanager resources, and split slot sharing groups.
For cases that coarse-grained resource management does not work well (as discussed above), fine-grained resource management provides a way for expert users to further optimize the resource utilization, by controlling how many resources each certain part of the pipeline should use, at the price of more user involvement.

Public Interfaces

Briefly list any new interfaces that will be introduced as part of this proposal or any existing interfaces that will be removed or changed. The purpose of this section is to concisely call out the public contract that will come along with this feature.

A public interface is any change to the following:

DataStream and DataSet API, including classes related to that, such as StreamExecutionEnvironment

Classes marked with the @Public annotation

On-disk binary formats, such as checkpoints/savepoints

User-facing scripts/command-line tools, i.e. bin/flink, Yarn scripts, Mesos scripts

Configuration settings

Exposed monitoring information

Proposed Changes

Describe the new thing you want to do in appropriate detail. This may be fairly extensive and have large subsections of its own. Or it may be a few sentences. Use judgement based on the scope of the change.

Compatibility, Deprecation, and Migration Plan

What impact (if any) will there be on existing users?
If we are changing behavior how will we phase out the older behavior?
If we need special migration tools, describe them here.
When will we remove the existing behavior?

Test Plan

Describe in few sentences how the FLIP will be tested. We are mostly interested in system tests (since unit-tests are specific to implementation details). How will we know that the implementation works as expected? How will we know nothing broke?

Rejected Alternatives

If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.

Page tree

FLIP-156: Runtime Interfaces for Fine-Grained Resource Requirements