Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Balanced, because the binary tree’s height determines the latency of the Reduce and Broadcast operations.
  • Maximum weight, because we want to maximize use of the highest bandwidth connections.


                                    

                                            (a) where it fits in KVStore                                                                        (b) InitMergeBuffersAndComm                          

...

Trees are generated in such a sequential fashion described above. To discourage later trees from using previously used links, we apply a multiplicative penalty term MXNET_KVSTORE_TREE_LINK_USAGE_PENALTY (default = 0.7) whenever a link has been used. This is multiplied to the initial link topology adjacency matrix where 3 represents double NVLink connection and 2 represents single NVLink connection.

When to switch between Single and Multiple tree

Image AddedImage Added

(a) Parameter sweep of MXNET_KVSTORE_TREE_ARRAY_BOUND                                      (b) 1 Push-Pull before Wait                                               (c) 150 Push-Pulls before Wait
Figure 7. VGG-16 performance as function of MXNET_KVSTORE_TREE_GPUARRAYARRAY_BOUND using batch size 4 per GPU. These figures show that beyond 1M-10M float32's, multi-tree begins to do better than a single tree.

Alternative Approaches considered

...