Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The existing parameter server approach to distributed MXNet faces limitations in performance and feature completeness (tensor fusion, single-bit gradient compression and ability to use MPI and NCCL.).

Horovod is an open-source distributed training framework , that has shown 2x speedup compared to distributed TensorFlow using innovative techniques [1, 2].

We propose to add Horovod support to MXNet. This will help our users achieve goal of linear scalability to 256 GPUs and beyond. Naturally, we will support multi-machine CPU training too.

Value Proposition

This project is seeking to provide an alternative distributed training solution for MXNet customers. It offers customers the following value proposition:

...

Figure 4. Preliminary benchmark on synthetic data comparing parameter server co-located (servers on same node as workers), parameter server 2 servers:1 worker, Intel MPI+MXNet, Horovod+Tensorflow, and Horovod+MXNet.

CPU support and GPU fp16 support

Since CPU support and GPU fp16 support are listed as experimental at the moment, we do not have performance numbers for them.

Addition of New APIs

We are introducing a new MXWaitForHorovodAllreduce and MXWaitForHorovodBroadcast function to the MXNet C API. This function will takes the form of:

  • void MXWaitforHorovodAllreduce( NDArray* input, NDArray* output, bool average, char* name, void (*func)(NDArray*, NDArray*, bool, char*, void (*cb)(Engine*, void*)))
  • void MXWaitforHorovodBroadcast( NDArray* input, NDArray* output, bool average, char* name, void (*func)(NDArray*, NDArray*, bool, char*, void (*cb)(Engine* void*)))

The parameters are:

  • input tells MXNet the NDArray that must be locked, as well as a parameter that must be passed back to Horovod
  • output tells MXNet the NDArray that must be locked, as well as a parameter that must be passed back to Horovod
  • average tells MXNet a parameter that must be passed back to Horovod
  • name tells MXNet a parameter that must be passed back to Horovod
  • func tells MXNet the function that is called in mxnet::Engine::PushAsync() must be passed back to Horovod

Test Plan

Functionality Tests

We will be introducing unit tests for most public Horovod functions at the Python level:

  • hvd.allreduce

  • hvd.broadcast_parameters

  • hvd.local_rank

  • hvd.rank

  • hvd.local_size

  • hvd.size

These will be contained under the path "horovod/test/test_mxnet.py", next to "test_tensorflow.py" and "test_torch.py". To run the test:

$ mpirun -np 8 --hostfile ~/hosts --bind-to none --map-by slot -x NCCL_DEBUG=INFO -x NCCL_MIN_NRINGS=4 -x LD_LIBRARY_PATH -x PATH -x MXNET_USE_OPERATOR_TUNING=0 -mca pml ob1 -mca btl ^openib test_mxnet.py

Performance Tests

Automated performance tests will be outside the scope of this project. For example, the Horovod repository itself does not have any performance tests. They provide a pointer to the repo from https://github.com/tensorflow/benchmarks, and say that the user will be able to replicate the performance numbers shown in the paper [1].

Technical Challenges

MXNet not intended to be used in 1 process/1 GPU mode

...

Aug. 10, 2018: Prototype API available for testing: https://github.com/ctcyang/horovod/tree/mxnet_fp16_divide_before_sum/examples/mxnet

  • GPU fp32 support has been tested
  • GPU fp16 support is still experimental
  • CPU support is experimental

Oct. 5, 2018: Beta release of Final API

...