Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • hvd.allreduce

  • hvd.broadcast_parameters

  • hvd.local_rank

  • hvd.rank

  • hvd.local_size

  • hvd.size

  • hvd.DistributedOptimizerhvd.BroadcastVariablesHook

The following are the two key calls from the Horovod API warrant additional explanation:

...

                                                       (a) hvd.allreduce                                                                                                                      (b) hvd.broadcast_parameters

Figure 1. How two key Horovod operations are implemented using Horovod API

...

This behaviour is the same as when doing pip install horovod for TensorFlow and PyTorch support. When those libraries are not present, Horovod installation will fail.



Figure 2. How Horovod interacts with MXNet engine

...

Code Changes to MXNet Repository

We need to:

  1. Introduce the new C API method Modify Module.fit to take a list of hvd.BroadcastVariableHooksMXWaitforHorovodAllreduce and MXWaitforHorovodBroadcast

  2. Make a folder called "include" that will be located in the same folder as the MXNet pip package. This will allow Horovod to call into MXNet for both the NDArray operations and the MXWaitForHorovod call that tells the MXNet engine which NDArrays must be locked for the duration of the Allreduce and Broadcast.

...

  • Proof of concept that MXNet can be made to work with Horovod

  • Code reuse (~90% of the code from the prototype will go into the final design)

  • Fast prototyping possible due to sidestepping challenges such as:

    • Compilation of Horovod pip package separate from MXNet, which requires some arcane mechanisms such as CFFI

    • Building DistributedOptimizer class that wraps Optimizer class in MXNetAdding hook to Module.fit to support calling hvd.broadcast at time of kv.init in MXNet

Our prototype uses the KVStore API to call Horovod backend. We expose a new KVStore class that can be selected by the user.

...