Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Page properties

Document the state by adding a label to the FLIP page with one of "discussion", "accepted", "released", "rejected".

Discussion threadTo be created
Vote threadTo be created
JIRA

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyFLINK-34025

Release1.19


Motivation

Currently users have to click on every operator and check how much data each sub-task is processing to see if there is data skew. This is particularly cumbersome and error-prone for jobs with big job graphs. Data skew is an important metric that should be more visible.

Public Interfaces

Exposed monitoring information. This is not expected to be a change in the programming interface, but rather a change in the Flink Dashboard UI. The monitoring data to be exposed is already available via the Flink REST API.

Proposed Changes

Data Skew Score

This section is in progress

...

Above metrics have their shortcomings. Metric 1 is too "conservative" i.e. it returns rather low scores for what I'd consider significant skew. Statisticians have already defined skewness metrics taking into account median and deviation - will look into those.

UI Changes

Additional "data skew" metric on the Flink job graph

As shown in below screenshot, each operator on the Flink job graph UI would show an additional Data skew score.

Additional tab to list all operators and their data skew score in descending order of their data skew score

The proposed tab would sit next to the Exceptions tab as its purpose seems to me to be more similar to the Exceptions tab than other tabs. Highlighted in red in below screenshot.

...

This FLIP does not talk in detail about how the UI of this new Data Skew tab should look. The look should be compatible with the rest of the UI. The list of checkpoints under the Checkpoints tab could be used for inspiration.

Compatibility, Deprecation, and Migration Plan

  • What impact (if any) will there be on existing users?

...

  • When will we remove the existing behavior?

N/A

Test Plan

UI Tests

The following end-to-end test scenarios will be carried out for the UI:

  • Given a job with no or close to 0 data skew, all operators show a data skew score of 0% or a figure close to 0%
  • Given a job with an operator that is suffering from data skew of about 50%, the figure is accurately reflect on the operator on the Flink job graph
  • Above scenarios are tested under the new Data skew tab
    • Operators are sorted according to their data skew score

Data Skew Score Tests


Rejected Alternatives

If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.