The existing parameter server approach to distributed MXNet faces limitations in performance and feature completeness (tensor fusion, single-bit gradient compression and ability to use MPI and NCCL.).

Horovod is an open-source distributed training framework , that has shown 2x speedup compared to distributed TensorFlow using innovative techniques [1, 2].

We propose to add Horovod support to MXNet. This will help our users achieve goal of linear scalability to 256 GPUs and beyond. Naturally, we will support multi-machine CPU training too.

Value Proposition

This project is seeking to provide an alternative distributed training solution for MXNet customers. It offers customers the following value proposition:

...

$ mpirun -np 8 --hostfile ~/hosts --bind-to none --map-by slot -x NCCL_DEBUG=INFO -x NCCL_MIN_NRINGS=4 -x LD_LIBRARY_PATH -x PATH -x MXNET_USE_OPERATOR_TUNING=0 -mca pml ob1 -mca btl ^openib python3 /home/ubuntu/master/3rdparty/horovod/examples/mxnet_imagenet_resnet50.py --benchmark 1 --batch-size=256 --network resnet-v1 --num-layers=50 --num-epochs 1 --kv-store horovod None --dtype float16 --gpus 0

Note: This call is valid when using OpenMPI. To improve user experience, this training script may be wrapped into a Bash script in the future.

...

(a) hvd.allreduce (b) hvd.broadcast_parameters

...

In every iteration the DistributedOptimizer wrapper will insert an Allreduce of the gradients before the weight update is done.
This is done by calling hvd.allreduce.
This calls down into the C API horovod_mxnet_allreduce_asyncThis calls a new method MXWaitForHorovod on the MXNet side of things. This is the only information that MXNet library knows about Horovod.
This calls MXNet's PushAsync, which creates a callback for Horovod to call upon completion of the Allreduce.
After the Allreduce is complete, Optimizer's weight update is done.

...

This behaviour is the same as when doing pip install horovod for TensorFlow and PyTorch support. When those libraries are not present, Horovod installation will fail.

Image AddedImage Removed

Figure 2. How Horovod interacts with MXNet engine

...

Easier integration with other MXNet bindings, because those bindings already support KVStore
User does not have to install another dependency in order to do distributed training, because MXNet build tool includes Horovod source code as a 3rd party dependency.

However, there is a trade-off, because then the Horovod source code would need to be maintained to ensure there are no regressions

Language bindings for languages other than Python are available without additional work

Performance Benchmarks

Final API Benchmarks

Model: resnet-v1, 50 layers

Dataset: synthetic

Dtype: float32

Instance types: Horovod+X (32 p3.16xlarge), parameter server (32 p3.16xlarge)

Image Added

Figure 4. Preliminary benchmark on synthetic data comparing parameter server co-located (servers on same node as workers) and Horovod+MXNet

Prototype Benchmarks

Model: resnet-v1, 50 layers

Dataset: synthetic

Dtype: float32

Instance types: Horovod+X (16 p3.16xlarge), parameter server (16 p3.16xlarge, 32 r4.16xlarge).

Figure 45. Preliminary benchmark on synthetic data comparing parameter server co-located (servers on same node as workers), parameter server 2 servers:1 worker, Intel MPI+MXNet, Horovod+Tensorflow, and Horovod+MXNet.

Addition of New APIs

We are introducing a new MXWaitForHorovodAllreduce and MXWaitForHorovodBroadcast function to the MXNet C API. This function will takes the form of:

void MXWaitforHorovodAllreduce( NDArray* input, NDArray* output, bool average, char* name, void (*func)(NDArray*, NDArray*, bool, char*, void (*cb)()))
void MXWaitforHorovodBroadcast( NDArray* input, NDArray* output, bool average, char* name, void (*func)(NDArray*, NDArray*, bool, char*, void (*cb)()))

The parameters are:

...

CPU support and GPU fp16 support

Since CPU support and GPU fp16 support are listed as experimental at the moment, we do not have performance numbers for them.

Test Plan

Functionality Tests

We will be introducing unit tests for most public Horovod functions at the Python level:

hvd.allreduce
hvd.broadcast_parameters
hvd.local_rank
hvd.rank
hvd.local_size
hvd.size

These will be contained under the path "horovod/test/test_mxnet.py", next to "test_tensorflow.py" and "test_torch.py". To run the test:

$ mpirun -np 8 --hostfile ~/hosts --bind-to none --map-by slot -x NCCL_DEBUG=INFO -x NCCL_MIN_NRINGS=4 -x LD_LIBRARY_PATH -x PATH -x MXNET_USE_OPERATOR_TUNING=0 -mca pml ob1 -mca btl ^openib test_mxnet.py

Performance Tests

...

Automated performance tests will be outside the scope of this project. For example, the Horovod repository itself does not have any performance tests. They provide a pointer to the repo from https://github.com/tensorflow/benchmarks, and say that the user will be able to replicate the performance numbers shown in the paper [1].

Technical Challenges

MXNet not intended to be used in 1 process/1 GPU mode

...

Aug. 10, 2018: Prototype API available for testing: https://github.com/ctcyang/horovod/tree/mxnet_fp16_divide_before_sum/examples/mxnet

GPU fp32 support has been tested
GPU fp16 support is still experimental
CPU support is experimental

Oct. 5, 2018: Beta release of Final API

...

Page tree

Versions Compared

Old Version 8

New Version Current

Key

Value Proposition

Performance Benchmarks

Final API Benchmarks

Prototype Benchmarks

Addition of New APIs

CPU support and GPU fp16 support

Test Plan

Functionality Tests

Performance Tests

Technical Challenges

MXNet not intended to be used in 1 process/1 GPU mode

Page tree

Page History

Versions Compared

Old Version 8

New Version Current

Key

Value Proposition

Performance Benchmarks

Final API Benchmarks

Prototype Benchmarks

Addition of New APIs

CPU support and GPU fp16 support

Test Plan

Functionality Tests

Performance Tests

Technical Challenges

MXNet not intended to be used in 1 process/1 GPU mode