Page History

...

Balanced, because the binary tree’s height determines the latency of the Reduce and Broadcast operations.
Maximum weight, because we want to maximize use of the highest bandwidth connections.

(a) where it fits in KVStore (b) InitMergeBuffersAndComm

...

Trees are generated in such a sequential fashion described above. To discourage later trees from using previously used links, we apply a multiplicative penalty term MXNET_KVSTORE_TREE_LINK_USAGE_PENALTY (default = 0.7) whenever a link has been used. This is multiplied to the initial link topology adjacency matrix where 3 represents double NVLink connection and 2 represents single NVLink connection.

When to switch between Single and Multiple tree

Image AddedImage Added

(a) Parameter sweep of MXNET_KVSTORE_TREE_ARRAY_BOUND (b) 1 Push-Pull before Wait (c) 150 Push-Pulls before Wait
Figure 7. VGG-16 performance as function of MXNET_KVSTORE_TREE_GPUARRAYARRAY_BOUND using batch size 4 per GPU. These figures show that beyond 1M-10M float32's, multi-tree begins to do better than a single tree.

Alternative Approaches considered

...

Page tree

Versions Compared

Old Version 21

New Version Current

Key

When to switch between Single and Multiple tree

Alternative Approaches considered