Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Number of records received by sub-tasks within an operatorSkewness

1 1 1 5 10 (i.e. five subtasks, first three each receives 1 record and last two gets 5 and 10 records each)

88%

5 5 5 5 5

0%

1 1 5 5 5

54%

4 5 5 5 5

7%

0 0 0 0 0

0 (idle operator)

The accumulation of "received number of records" over a long period of time can hide a recent data skew event. The same can also hide a recent fix to an existing data skew problem. Therefore the proposed metric will need to look at the change in the received number of records within a period, similar to the existing "Backpressure" or "Busy" metrics on the Flink Job Graph.


See the "rejected alternatives" section for other metrics that were considered.

Note that this FLIP is designed to address an immediate gap in the monitoring of Flink jobs in the Flink UI. If the proposed data skew score is found to be not sufficient or if the users prefer a different metric, it can be improved in future FLIPs, and indeed can be made configurable (e.g. by using the Strategy Pattern).

UI Changes

Additional "

...

Data Skew"

...

Metric on the Flink

...

Job Graph

As shown in below screenshot, each operator on the Flink job graph UI would show an additional Data skew score.

...

The accumulation of "received number of records" over a long period of time can hide a recent data skew event. The same can also hide a recent fix to an existing data skew problem. Therefore the proposed metric will need to look at the change in the received number of records within a period, similar to the existing "Backpressure" or "Busy" metrics on the Flink Job Graph.

Additional Tab to List All Operators and Their Data Skew Score in Descending Order of Their Data Skew Score

The proposed tab would sit next to the Exceptions tab as its purpose seems to me to be more similar to the Exceptions tab than other tabs. Highlighted in red in below screenshot.

...

This FLIP does not talk in detail about how the UI of this new Data Skew tab should look. The look should be compatible with the rest of the UI. The list of checkpoints under the Checkpoints tab could be used for inspiration.

This new Data Skew tab will show the overall accumulated data skew score of the operators as opposed to current/live view proposed under the Additional "Data Skew" Metric on the Flink Job Graph section. This page will also contain a definition of what data skew is and the metric being used to calculate it.

Compatibility, Deprecation, and Migration Plan

...