THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!
...
Rejected data skew score metrics
Metric 1:
Code Block |
---|
(max(list_of_number_of_records_received_by_each_subtaskssubtask) - min(list_of_number_of_records_received_by_subtaskseach_subtask)) / sum(list_of_number_of_records_received_by_each_subtaskssubtask) |
Example scenarios:
- Given 10 sub-tasks, 1 of which has received 1 record only. The other 9 sub-tasks each has received 10 records.
Data skew score: (9 - 1) / (9*10 + 1*1) = 10% - Given 10 sub-tasks. 5 sub-tasks have each received 1 record only. Each of the other 5 sub-tasks has received 10 records.
Data skew score: (10 - 5) / (5*1 + 5*10) = 9% - Given 10 sub-tasks. 5 sub-tasks have each received 1 record only. Each of the other 5 sub-tasks has received 50 records.
Data skew score: (50 - 5) / (5*1 + 5*50) = 18%
This metric gives us somewhat low scores for what I'd consider significant skew. For instance in the last example, half of the sub-tasks received a x50 fewer records than the other half and the score is only 18%.
Metric 2:
Code Block |
---|
min(list_of_number_of_records_received_by_each_subtaskssubtask)/max(list_of_number_of_records_received_by_subtaskseach_subtask) |
- Given 10 sub-tasks, 1 of which has received 1 record only. The other 9 sub-tasks each has received 10 records.
Data skew score: 10% - Given 10 sub-tasks. 5 sub-tasks have each received 1 record only. Each of the other 5 sub-tasks has received 10 records.
Data skew score: 10% - Given 10 sub-tasks. 5 sub-tasks have each received 1 record only. Each of the other 5 sub-tasks has received 50 records.
Data skew score: 2%
...