Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In addition, our proposed system ought to automatically detect how GPUs are connected in a single machine (topology-awareness), and use this knowledge to automatically generate the balanced binary trees of maximum weight. This way, the proposed system will work out-of-the-box for any network topology and not only p3.16xlarge. The characteristics we want are:

  • Balanced, because the binary tree’s height determines the latency of the Reduce and Broadcast operations.
  • Maximum weight, because we want to maximize use of the highest bandwidth connections.


Image Added                     Image Added

                                            (a) where it fits in KVStore                                                                        (b) InitMergeBuffersAndComm                          

Image Added                      Image Added

                                   (c) Reduce                                                                                                                         (d) Broadcast
Figure 4. Block diagram of proposed addition. Changes to old initialization (InitMergeBuffersAndComm), Reduce and Broadcast are illustrated.

Note: Additional memory copies to temporary buffer (temp) is necessary following Reduce and Broadcast, because we do not know the final destination buffer dst at this time. However due to the interface of kvstore::Push() (i.e. the user-exposed method that calls Reduce), this information is not known until kvstore::Pull() (i.e. until Broadcast). This means that unless the API is changed to PushPull (which has both source and destination arguments), we will need to do extra write to temp buffer in Reduce, and an extra write to temp buffer in Broadcast. The good thing is that these writes are on the same GPU, so they do not take significant amount of time (less than 1% of the runtime). Interestingly, this is the same API change that the other All-Reduce related proposal is asking for.

Technical Challenges

Generate Binary Tree

...

Table 1. Peak speed-up on small batch sizes.


Vs. Parameter Server (in comm.h)Vs. NCCL (in kvstore_nccl.h)
Resnet-501.191.33
VGG-165.891.06
Inception-v31.151.34
AlexNet6.601.42

Figure 6. End-to-end training results on synthetic data showing speed-up vs. NCCL on fp32 and fp16.

...

Multi-machine Design Proposal

There is another excellent All-Reduce related proposal on this wiki. The problem they are trying to solve differs from ours in two ways:

...