Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The existing parameter server approach to distributed MXNet faces limitations in performance and feature completeness (tensor fusion, gradient compression, single-bit gradient compression and ability to use MPI and NCCL.).

...

  1. Usability - Users do not have to experiment with number of workers and number of servers to get best performance out-of-the-box.

  2. Performance - Horovod + Tensorflow has shown 2x performance of Distributed Tensorflow [1], so we expect it to compare well to parameter servershow similar gains.

  3. Cost savings - Parameter servers are not needed when they use Horovod.

  4. Simplified architecture - Leverage battle-tested libraries such as MPI and NCCL, as well as network optimizations such as RDMA.

  5. Profiler - Horovod has an excellent profiler for finding bottlenecks.

  6. Online learning - Due to its MPI paradigm, Horovod can save checkpoints which enables online learning and fine-tuning of your model. With parameter server, it takes some additional work to save Optimizer state located on servers, but with Horovod , this feature comes for free. Note: this feature is currently not supported.

  7. Community - Horovod is a way for MXNet to leverage the Deep Learning community for advancements in distributed training, and for increasing MXNet's visibility.

Proposed Approach

User Interface

...

Instance types: Horovod+X (16 p3.16xlarge), parameter server (16 p3.16xlarge, 32 r4.16xlarge).

...

...

Figure 4. Preliminary benchmark on synthetic data comparing parameter server co-located (servers on same node as workers), parameter server 2 servers:1 worker) , Intel MPI+MXNet, Horovod+Tensorflow, and Horovod+MXNet.

Addition of New APIs

We are introducing a new MXWaitForHorovodAllreduce and MXWaitForHorovodBroadcast function to the MXNet C API. This function will takes the form of:

...

Oct. 5, 2018: Beta release of Final APIAP


References

[1] Sergeev, Alexander, and Mike Del Balso. "Horovod: fast and easy distributed deep learning in TensorFlow." arXiv preprint arXiv:1802.05799 (2018). https://arxiv.org/pdf/1802.05799.pdf

...