Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Rejected data skew score metrics

Metric 1:

Code Block
(max(list_of_number_of_records_received_by_each_subtaskssubtask) - min(list_of_number_of_records_received_by_subtaskseach_subtask)) / sum(list_of_number_of_records_received_by_each_subtaskssubtask)

Example scenarios:

  • Given 10 sub-tasks, 1 of which has received 1 record only. The other 9 sub-tasks each has received 10 records.
    Data skew score: (9 - 1) / (9*10 + 1*1) = 10%
  • Given 10 sub-tasks. 5 sub-tasks have each received 1 record only. Each of the other 5 sub-tasks has received 10 records.
    Data skew score: (10 - 5) / (5*1 + 5*10) = 9%
  • Given 10 sub-tasks. 5 sub-tasks have each received 1 record only. Each of the other 5 sub-tasks has received 50 records.
    Data skew score: (50 - 5) / (5*1 + 5*50) = 18%

This metric gives us somewhat low scores for what I'd consider significant skew. For instance in the last example, half of the sub-tasks received a x50 fewer records than the other half and the score is only 18%.

Metric 2:

Code Block
min(list_of_number_of_records_received_by_each_subtaskssubtask)/max(list_of_number_of_records_received_by_subtaskseach_subtask)
  • Given 10 sub-tasks, 1 of which has received 1 record only. The other 9 sub-tasks each has received 10 records.
    Data skew score: 10%
  • Given 10 sub-tasks. 5 sub-tasks have each received 1 record only. Each of the other 5 sub-tasks has received 10 records.
    Data skew score: 10%
  • Given 10 sub-tasks. 5 sub-tasks have each received 1 record only. Each of the other 5 sub-tasks has received 50 records.
    Data skew score: 2%

...