Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Usability - Users do not have to experiment with number of workers and number of servers to get best performance out-of-the-box.

  2. Performance - Horovod + Tensorflow has shown 2x performance of Distributed Tensorflow [1], so we expect it to show similar gains.

  3. Cost savings - Parameter servers are not needed when they use Horovod.

  4. Simplified architecture - Leverage battle-tested libraries such as MPI and NCCL, as well as network optimizations such as RDMA.

  5. Profiler - Horovod has an excellent profiler for finding bottlenecks.

  6. Online learning - Due to its MPI paradigm, Horovod can save checkpoints which enables online learning and fine-tuning of your model. With parameter server, it takes some additional work to save Optimizer state located on servers, but with Horovod this feature comes for free. Note: this feature is not currently not supported.

  7. Community - Horovod is a way for MXNet to leverage the Deep Learning community for advancements in distributed training, and for increasing MXNet's visibility.

...

Figure 1. How two key Horovod operations are implemented using Horovod API


Going through how Allreduce works, the DistributedOptimizer is used to wrap an MXNet Optimizer class:

  1. In every iteration the DistributedOptimizer wrapper will insert an Allreduce of the gradients before the weight update is done.
  2. This is done by calling hvd.allreduce.
  3. This calls down into the C API horovod_mxnet_allreduce_async
  4. This calls a new method MXWaitForHorovod on the MXNet side of things. This is the only information that MXNet library knows about Horovod.
  5. This calls MXNet's PushAsync, which creates a callback for Horovod to call upon completion of the Allreduce.
  6. After the Allreduce is complete, Optimizer's weight update is done.

Horovod Interaction with MXNet

...

Since we are linking with the MXNet shared library, we need to include the correct headers in the PyPi package. In order to avoid ABI compatibility issues, we may need to add additional APIs (e.g. mx.config.get_compile_flags or mx.config.get_link_flags) that return the compilation and linker flags respectively. Then, the Horovod installation can proceed using the exact same flags.

Gluon support

Gluon support can be added by:

  1. making sure to pass the DistributedOptimizer object into Trainer instead of into Module
  2. using hvd.broadcast_params on the exposed initialized parameters

Milestones

Aug. 10, 2018: Prototype API available for testing: https://github.com/ctcyang/horovod/tree/mxnet_fp16_divide_before_sum/examples/mxnet

Oct. 5, 2018: Beta release of Final APAPI


References

[1] Sergeev, Alexander, and Mike Del Balso. "Horovod: fast and easy distributed deep learning in TensorFlow." arXiv preprint arXiv:1802.05799 (2018). https://arxiv.org/pdf/1802.05799.pdf

...