...
The above metric seems to yield reasonable results. Examples:
Number of records received by sub-tasks within an operator | Skewness |
---|---|
1 1 1 5 10 (i.e. five subtasks, first three each receives 1 record and last two gets 5 and 10 records each) | 88% |
5 5 5 5 5 | 0% |
1 1 5 5 5 | 54% |
4 5 5 5 5 | 7% |
0 0 0 0 0 |
0 (idle operator) |
The accumulation of "received number of records" over a long period of time can hide a recent data skew event. The same can also hide a recent fix to an existing data skew problem. Therefore the proposed metric will need to look at the change in the received number of records within a period, similar to the existing "Backpressure" or "Busy" metrics on the Flink Job Graph.
...