Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Double NVLink connections (50 GB/s)
  2. Single NVLink connections (25 GB/s)
  3. PCI-E connection + CPU (10 GB/s)

Image Added

                                                        (a) 2D view                                                                              (b) 3D view

Figure 3. Link topology of p3.16xlarge instance.

...

Note: Additional memory copies to temporary buffer (temp) is necessary following Reduce and Broadcast, because we do not know the final destination buffer dst at this time. However due to the interface of kvstore::Push() (i.e. the user-exposed method that calls Reduce), this information is not known until kvstore::Pull() (i.e. until Broadcast). This means that unless the API is changed to PushPull (which has both source and destination arguments), we will need to do extra write to temp buffer in Reduce, and an extra write to temp buffer in Broadcast. The good thing is that these writes are on the same GPU, so they do not take significant amount of time (less than 1% of the runtime). Interestingly, this is the same API change that the other All-Reduce related proposal is asking for.

Technical Challenges

As shown in Figure 5, the design can be broken into two tasks:

  1. Using binary tree topology_ to do a reduce
  2. Generating binary tree topology_

1. Use Binary Tree to do Reduce

First, given a binary tree we show that it gives us the precise sequence of GPU sends in order to perform a reduce. On the left will be the link topology in 3D view. On the right is the portion of the binary tree we are interested in.

Image Added

(a) Step 1

Image Added

(b) Step 2

Image Added

(c) Step 3
Image Added

(d) On link topology (left), the sequence of sends forms a spanning tree. The operations also correspond to a binary tree (right).

Figure 6. Reduce on a single binary tree, showing connection between reduce and binary tree.

2. Generate Binary Tree

We looked at using two methods to generate the binary tree. Since we are looking for a balanced binary tree instead of a maximum spanning tree, we cannot use polynomial time algorithms such as Kruskal's or Boruvka's MST algorithm. Using the link topology saved in an adjacency matrix as input, we used:

...