Page History

...

Rejected data skew score metrics

Metric 1:

Code Block
(max(list_of_number_of_records_received_by_each_subtaskssubtask) - min(list_of_number_of_records_received_by_subtaskseach_subtask)) / sum(list_of_number_of_records_received_by_each_subtaskssubtask)

Example scenarios:

Given 10 sub-tasks, 1 of which has received 1 record only. The other 9 sub-tasks each has received 10 records.
Data skew score: (9 - 1) / (9*10 + 1*1) = 10%
Given 10 sub-tasks. 5 sub-tasks have each received 1 record only. Each of the other 5 sub-tasks has received 10 records.
Data skew score: (10 - 5) / (5*1 + 5*10) = 9%
Given 10 sub-tasks. 5 sub-tasks have each received 1 record only. Each of the other 5 sub-tasks has received 50 records.
Data skew score: (50 - 5) / (5*1 + 5*50) = 18%

This metric gives us somewhat low scores for what I'd consider significant skew. For instance in the last example, half of the sub-tasks received a x50 fewer records than the other half and the score is only 18%.

Metric 2:

Code Block
min(list_of_number_of_records_received_by_each_subtaskssubtask)/max(list_of_number_of_records_received_by_subtaskseach_subtask)

Given 10 sub-tasks, 1 of which has received 1 record only. The other 9 sub-tasks each has received 10 records.
Data skew score: 10%
Given 10 sub-tasks. 5 sub-tasks have each received 1 record only. Each of the other 5 sub-tasks has received 10 records.
Data skew score: 10%
Given 10 sub-tasks. 5 sub-tasks have each received 1 record only. Each of the other 5 sub-tasks has received 50 records.
Data skew score: 2%

...

Page tree

Versions Compared

Old Version 8

New Version 9

Key

Rejected data skew score metrics

Metric 1:

Metric 2: