Document the state by adding a label to the FLIP page with one of "discussion", "accepted", "released", "rejected".

Discussion thread	To be created
Vote thread	To be created
JIRA	Unable to render Jira issues macro, execution error.
Release	1.19

Motivation

Currently users have to click on every operator and check how much data each sub-task is processing to see if there is data skew. This is particularly cumbersome and error-prone for jobs with big job graphs. Data skew is an important metric that should be more visible.

Public Interfaces

Exposed monitoring information. This is not expected to be a change in the programming interface, but rather a change in the Flink Dashboard UI. The monitoring data to be exposed is already available via the Flink REST API.

Proposed Changes

Data Skew Score

This section is in progress

Metric 1:

To measure data skew, the following formula could be used for each operator:

(max(number_of_records_received_by_subtasks) - min(number_of_records_received_by_subtasks)) / sum(number_of_records_received_by_subtasks)

Example scenarios:

An operator has 10 sub-tasks, 1 of which has received 1 record only. The other 9 sub-tasks each has received 10 records.
Data skew score: (9 - 1) / (9*10 + 1*1) = 10%
An operator has 10 sub-tasks. 5 sub-tasks have each received 1 record only. Each of the other 5 sub-tasks has received 10 records.
Data skew score: (10 - 5) / (5*1 + 5*10) = 9%
An operator has 10 sub-tasks. 5 sub-tasks have each received 1 record only. Each of the other 5 sub-tasks has received 50 records.
Data skew score: (50 - 5) / (5*1 + 5*50) = 18%

Metric 2:

min(number_of_records_received_by_subtasks)/max(number_of_records_received_by_subtasks)

Metric 3:

Above metrics have their shortcomings. Metric 1 is too "conservative" i.e. it returns rather low scores for what I'd consider significant skew. Statisticians have already defined skewness metrics taking into account median and deviation - will look into those.

UI Changes

Additional "data skew" metric on the Flink job graph

As shown in below screenshot, each operator on the Flink job graph UI would show an additional Data skew score.

Additional tab to list all operators and their data skew score in descending order of their data skew score

The proposed tab would sit next to the Exceptions tab as its purpose seems to me to be more similar to the Exceptions tab than other tabs. Highlighted in red in below screenshot.

This FLIP does not talk in detail about how the UI of this new Data Skew tab should look. The look should be compatible with the rest of the UI. The list of checkpoints under the Checkpoints tab could be used for inspiration.

Compatibility, Deprecation, and Migration Plan

What impact (if any) will there be on existing users?

No negative impact. Existing users will enjoy a better overview of their data skew state. They will be able to see if their job is suffering from data skew at one glance.

If we are changing behavior how will we phase out the older behavior?

N/A

If we need special migration tools, describe them here.

N/A

When will we remove the existing behavior?

N/A

Test Plan

UI Tests

The following end-to-end test scenarios will be carried out for the UI:

Given a job with no or close to 0 data skew, all operators show a data skew score of 0% or a figure close to 0%
Given a job with an operator that is suffering from data skew of about 50%, the figure is accurately reflect on the operator on the Flink job graph
Above scenarios are tested under the new Data skew tab
- Operators are sorted according to their data skew score

Data Skew Score Tests

Rejected Alternatives

If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.

Page tree

FLIP-417: Show data skew score on Flink Dashboard