...
Code Block |
---|
min(average_absolute_deviation(list_of_number_of_records_received_by_each_subtask) / mean(list_of_number_of_records_received_by_each_subtask) * 100, 100) |
Instead of using standard deviation, I choose to use "average absolute deviation" which avoids the multiplication and the square-root operations and simplifies the metric calculation while serving the same purpose and yielding similar results.
...
Number of records received by sub-tasks within an operator | Data Skew Score |
---|---|
1 1 1 5 10 (i.e. five subtasks, first three each receives 1 record and last two gets 5 and 10 records each) | 88% |
5 5 5 5 5 | 0% |
1 1 5 5 5 | 54% |
4 5 5 5 5 | 7% |
0 0 0 0 0 | 0 (idle operator) |
Proposed metric names:
- dataSkewPercentage: This will be used to show an overall or historical data skew score under the proposed Data Skew tab (see the UI Changes section)
- dataSkewPercentagePerSecond: This will be used to show a "live" score on the Job Graph (see the UI Changes section).
See the "rejected alternatives" section for other metrics that were considered.
...