Page History

...

In addition, our proposed system ought to automatically detect how GPUs are connected in a single machine (topology-awareness), and use this knowledge to automatically generate the balanced binary trees of maximum weight. This way, the proposed system will work out-of-the-box for any network topology and not only p3.16xlarge. The characteristics we want are:

Balanced, because the binary tree’s height determines the latency of the Reduce and Broadcast operations.
Maximum weight, because we want to maximize use of the highest bandwidth connections.

Image Added Image Added

(a) where it fits in KVStore (b) InitMergeBuffersAndComm

Image Added Image Added

(c) Reduce (d) Broadcast
Figure 4. Block diagram of proposed addition. Changes to old initialization (InitMergeBuffersAndComm), Reduce and Broadcast are illustrated.

Note: Additional memory copies to temporary buffer (temp) is necessary following Reduce and Broadcast, because we do not know the final destination buffer dst at this time. However due to the interface of kvstore::Push() (i.e. the user-exposed method that calls Reduce), this information is not known until kvstore::Pull() (i.e. until Broadcast). This means that unless the API is changed to PushPull (which has both source and destination arguments), we will need to do extra write to temp buffer in Reduce, and an extra write to temp buffer in Broadcast. The good thing is that these writes are on the same GPU, so they do not take significant amount of time (less than 1% of the runtime). Interestingly, this is the same API change that the other All-Reduce related proposal is asking for.