Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Exposed monitoring information. This is not expected to be a change in the programming interface, but rather a change in the Flink Dashboard UI . The monitoring data to be exposed is already available via the Flink REST API.and an addition to the list of available metrics. 

Proposed Changes

Data Skew Score

Using a variant of the metric definition in https://en.wikipedia.org/wiki/Coefficient_of_variation

Code Block
standardmin(average_absolute_deviation(list_of_number_of_records_received_by_each_subtask) / mean(list_of_number_of_records_received_by_each_subtask), 100)

Instead of using standard deviation, I choose to use "average absolute deviation" which avoids the multiplication and the square-root operations and simplifies the metric calculation while serving the same purpose and yielding similar results.

We need to cap the metric at 100% because the Coefficient of Variation calculation can result in values above 100 when there is large variance in the dataset, hence the min operation.

The above metric seems to yield reasonable results. Examples:

Number of records received by sub-tasks within an operatorSkewness

1 1 1 5 10 (i.e. five subtasks, first three each receives 1 record and last two gets 5 and 10 records each)

88%

5 5 5 5 5

0%

1 1 5 5 5

54%

4 5 5 5 5

7%

0 0 0 0 0

Undefined (idle operator)


The accumulation of "received number of records" over a long period of time can hide a recent data skew event. The same can also hide a recent fix to an existing data skew problem. Therefore the proposed metric will need to look at the change in the received number of records within a period, similar to the existing "Backpressure" or "Busy" metrics on the Flink Job Graph.

See the "rejected alternatives" section for other metrics that were considered.

...

  • Given a job with no or close to 0 data skew, all operators show a data skew score of 0% or a figure close to 0%
  • Given a job with an operator that is suffering from data skew of about 50%, the figure is accurately reflect on the operator on the Flink job graph
  • Above scenarios are tested under the new Data skew tab
    • Operators are sorted according to their data skew score
  • Impact on UI
    • Page load speed comparison (there should not be extra delay to the job graph page load time). If delay turns out to be unavoidable, this should be documented as a side-effect
    • Monitor page memory usage and ensure there is no memory leak or memory usage is not noticeably higher than before. If noticeable increase in memory is unavoidable, document this as a side-effect

Data Skew Score Tests

The following behaviour will be unit and end-to-end tested:

...