Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

View file
namenccl.pdf
height400
View file
nameallreduce.pdf
height400

                        (a) Ring algorithm used by NCCL                                    (b) Parameter server algorithm

...

Figure 2 explains the end-to-end performance results (see Figure x8) that show Parameter server is faster for networks that require many Push-Pulls over relatively small keys (e.g. ResNet-50, Inception-48 need over 157 Push-Pulls on keys not exceeding 2M floats in size), but NCCL ring Reduce is faster for networks that only need Push-Pull over few keys (e.g. VGG-16 and AlexNet only need fewer than 32 Push-Pulls on keys that exceed 10M in size).

...

  1. Double NVLink connections (50 GB/s)
  2. Single NVLink connections (25 GB/s)
  3. PCI-E connection + CPU (10 GB/s)

Image Added

                                                        (a) 2D view                                                                (b) 3D view (PCI-E not shown)

Figure 3. Link topology of p3.16xlarge instance.

...

Both methods are shown in Figure 34.


View file
nametree.pdf
height400
View file
namemulti-tree.pdf
height400

                          (a) Single tree algorithm                                             (b) Multiple tree algorithm

Figure 34. Proposed Reduce and Broadcast algorithms.

...

  • Balanced, because the binary tree’s height determines the latency of the Reduce and Broadcast operations.
  • Maximum weight, because we want to maximize use of the highest bandwidth connections.


Image Modified                               Image Added        Image Removed

                                            (a) where it fits in KVStore                                                                        (b) InitMergeBuffersAndComm                          

                                            

                                   (c) Reduce                                                                                                                         (d) Broadcast
Figure 45. Block diagram of proposed addition. Changes to old initialization (InitMergeBuffersAndComm), Reduce and Broadcast are illustrated.

Note: Additional memory copies to temporary buffer (temp) is necessary following Reduce and Broadcast, because we do not know the final destination buffer dst at this time. However due to the interface of kvstore::Push() (i.e. the user-exposed method that calls Reduce), this information is not known until kvstore::Pull() (i.e. until Broadcast). This means that unless the API is changed to PushPull (which has both source and destination arguments), we will need to do extra write to temp buffer in Reduce, and an extra write to temp buffer in Broadcast. The good thing is that these writes are on the same GPU, so they do not take significant amount of time (less than 1% of the runtime). Interestingly, this is the same API change that the other All-Reduce related proposal is asking for.

Technical Challenges

As shown in Figure 5, the design can be broken into two tasks:

  1. Using binary tree topology_ to do a reduce
  2. Generating binary tree topology_

1. Use Binary Tree to do Reduce

First, given a binary tree we show that it gives us the precise sequence of GPU sends in order to perform a reduce. On the left will be the link topology in 3D view. On the right is the portion of the binary tree we are interested in.

Image Added

(a) Step 1: GPU1 sends to GPU5, where the combined results of GPU1 and GPU5 are reduced on GPU5.

Image Added

(b) Step 2: GPU7 sends to GPU5, where the combined results of GPU1, GPU3, GPU5 and GPU7 are reduced on GPU5.

Image Added

(c) Step 3: GPU4 sends to GPU5, where the combined results of all 8 GPUs are reduced on GPU5.
Image Added

(d) On link topology (left), the sequence of sends forms a spanning tree. The operations also correspond to a binary tree (right).

Figure 6. Reduce on a single binary tree, showing connection between reduce and binary tree.

2. Generate Binary Tree

We looked at using two methods to generate the binary tree. Since we are looking for a balanced binary tree instead of a maximum spanning tree, we cannot use polynomial time algorithms such as Kruskal's or Boruvka's MST algorithm. Using the link topology saved in an adjacency matrix as input, we used:

...

This worked well most of the time. However, when trying to find such a tree for 6 GPUs, we notice that sometimes this gets stuck and an edge cannot be found to link two such clusters. In such cases, we resorted to exhaustive search.

Link usage penalty

Trees are generated in such a sequential fashion described above. To discourage later trees from using previously used links, we apply a multiplicative penalty term MXNET_KVSTORE_TREE_LINK_USAGE_PENALTY (default = 0.7) whenever a link has been used. This is multiplied to the initial link topology adjacency matrix where 3 represents double NVLink connection and 2 represents single NVLink connection.

When to switch between Single and Multiple tree

Image AddedImage Added

(a) Parameter sweep of MXNET_KVSTORE_TREE_ARRAY_BOUND                                      (b) 1 Push-Pull before Wait                                               (c) 150 Push-Pulls before Wait
Figure 7
Figure 5. VGG-16 performance as function of MXNET_KVSTORE_TREE_BIGARRAYARRAY_BOUND using batch size 4 per GPU. These figures show that beyond 1M-10M float32's, multi-tree begins to do better than a single tree.

Alternative Approaches considered

...


Vs. Parameter Server (in comm.h)Vs. NCCL (in kvstore_nccl.h)
Resnet-501.191.33
VGG-165.891.06
Inception-v31.151.34
AlexNet6.601.42

Figure 68. End-to-end training results on synthetic data showing speed-up vs. NCCL on fp32 and fp16.

...