Page History

...

View file

name	nccl.pdf
height	400

View file

name	allreduce.pdf
height	400

(a) Ring algorithm used by NCCL (b) Parameter server algorithm

...

Figure 2 explains the end-to-end performance results (see Figure x8) that show Parameter server is faster for networks that require many Push-Pulls over relatively small keys (e.g. ResNet-50, Inception-48 need over 157 Push-Pulls on keys not exceeding 2M floats in size), but NCCL ring Reduce is faster for networks that only need Push-Pull over few keys (e.g. VGG-16 and AlexNet only need fewer than 32 Push-Pulls on keys that exceed 10M in size).

...

Balanced, because the binary tree’s height determines the latency of the Reduce and Broadcast operations.
Maximum weight, because we want to maximize use of the highest bandwidth connections.

Image Modified

(a) where it fits in KVStore (b) InitMergeBuffersAndComm

(c) Reduce (d) Broadcast
Figure 5. Block diagram of proposed addition. Changes to old initialization (InitMergeBuffersAndComm), Reduce and Broadcast are illustrated.

...

This worked well most of the time. However, when trying to find such a tree for 6 GPUs, we notice that sometimes this gets stuck and an edge cannot be found to link two such clusters. In such cases, we resorted to exhaustive search.

Link usage penalty

Trees are generated in such a sequential fashion described above. To discourage later trees from using previously used links, we apply a multiplicative penalty term MXNET_KVSTORE_TREE_LINK_USAGE_PENALTY (default = 0.7) whenever a link has been used. This is multiplied to the initial link topology adjacency matrix where 3 represents double NVLink connection and 2 represents single NVLink connection.

When to switch between Single and Multiple tree

Image AddedImage Added

(a) Parameter sweep of MXNET_KVSTORE_TREE_ARRAY_BOUND (b) 1 Push-Pull before Wait (c) 150 Push-Pulls before Wait
Figure 7
Figure 6. VGG-16 performance as function of MXNET_KVSTORE_TREE_BIGARRAYARRAY_BOUND using batch size 4 per GPU. These figures show that beyond 1M-10M float32's, multi-tree begins to do better than a single tree.