The existing parameter server approach to distributed MXNet faces limitations in performance and feature completeness (tensor fusion, gradient compression, single-bit gradient compression and ability to use MPI and NCCL.).

Usability - Users do not have to experiment with number of workers and number of servers to get best performance out-of-the-box.
Performance - Horovod + Tensorflow has shown 2x performance of Distributed Tensorflow [1], so we expect it to compare well to parameter servershow similar gains.
Cost savings - Parameter servers are not needed when they use Horovod.
Simplified architecture - Leverage battle-tested libraries such as MPI and NCCL, as well as network optimizations such as RDMA.
Profiler - Horovod has an excellent profiler for finding bottlenecks.
Online learning - Due to its MPI paradigm, Horovod can save checkpoints which enables online learning and fine-tuning of your model. With parameter server, it takes some additional work to save Optimizer state located on servers, but with Horovod , this feature comes for free. Note: this feature is currently not supported.
Community - Horovod is a way for MXNet to leverage the Deep Learning community for advancements in distributed training, and for increasing MXNet's visibility.

Proposed Approach

User Interface

...

Instance types: Horovod+X (16 p3.16xlarge), parameter server (16 p3.16xlarge, 32 r4.16xlarge).

...

Figure 4. Preliminary benchmark on synthetic data comparing parameter server co-located (servers on same node as workers), parameter server 2 servers:1 worker) , Intel MPI+MXNet, Horovod+Tensorflow, and Horovod+MXNet.

Addition of New APIs

We are introducing a new MXWaitForHorovodAllreduce and MXWaitForHorovodBroadcast function to the MXNet C API. This function will takes the form of:

...

Oct. 5, 2018: Beta release of Final APIAP

References

[1] Sergeev, Alexander, and Mike Del Balso. "Horovod: fast and easy distributed deep learning in TensorFlow." arXiv preprint arXiv:1802.05799 (2018). https://arxiv.org/pdf/1802.05799.pdf

...

Page tree

Versions Compared

Old Version 3

New Version 4

Key

Proposed Approach

User Interface

Addition of New APIs

References

Page tree

Page History

Versions Compared

Old Version 3

New Version 4

Key

Proposed Approach

User Interface

Addition of New APIs

References