Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

PR is closed and the Horovod solution will be used for the distributed training.

The current implementation is partially merged into Horovod solution.

Therefore, the new solution will take the advantages of both this PR and Horovod so it will be very nice for the community

Good doc as below and welcome to comment and modify:

https://docs.google.com/document/d/1e4anwDiS18cWP49FAghU6tqqdtnRKUcbNJJxvhIfvIA/edit#heading=h.t762l56r1094

...

.

https://github.com/apache/incubator-mxnet/pull/10696

 

Problem Statement

MXNet inherent distributed training mechanism, parameter server, provides efficient communication in ASGD and fault tolerance especially in cloud environment. But we found that in small-scale number of nodes (8-64) mpi allreduce can achieved the scaling efficiency close to linear while there's no extra server node deployment. So we suggest to add mpi-allreduce as an alternative choice for customer in MXNet multi-node distributed training. 

...

Machine: SKX6148, Network: 10GbE, Topology: VGG16, Local Batch Size: 64, KVStore Type: dist_sync. Parameter Server

work numserver numPer Node FPS(pic/s)Scaling Efficiency
88(worker and server share node)19.8767.81%
8827.393.17%
8422.777.47%
8211.1137.90%

Command line: python tools/launch.py -n 8 -s <server_num> --launcher ssh -H hosts python example/image-classification/train_vgg16.py --kv-store dist_sync


Following is the result of MXNet multinode with mpi allreduce supported from our proof of concept (ready):

Node NumPer Node FPS(pic/s)Scaling Efficiency
827.7694.74%

Command line: mpirun -n 8 -ppn 1 -machinefile hosts python example/image-classification/train_vgg16.py --kv-store dist_sync_mpi

...

Following is allreduce benchmark (400M payload 8 worker num) for reference:

Methodserver numworker numTime(s)
ParameterServer (push+pull)186.5
ParameterServer (push+pull)283.4
ParameterServer (push+pull)482.0
ParameterServer (push+pull)881.2
MPI.AllreduceN/A81.0


From the performance data, we can draw the following conclusions:

...

Moreover, we noticed that most of mainstream Deep Learning frameworks have mpi-allreduce-based distributed training mechnism:

FrameworkDistributed Communication Mechanism
Tensorflow

PS + mpi-allreduce(baidu allreduce)
(uber horovod)

MXNetPS
Caffempi-allreduce
Torch + PyTorchmpi-allreduce
Chainermpi-allreduce(mpi4py)

 

Goal

Besides existing distributed training mechanism parameter server in MXNet, we suggest to add mpi-allreduce as an alternative distributed training mechanism which can significantly enhance multi-node scaling efficiency for synchronous SGD distributed training with least cost. 

...

  1. init(self, key, value):
    Initializes a single or a sequence of key-value pairs into the store.
    Not supported in kvstore with type dist_sync_mpi

  2. pushpull(self, key, ins, outs, priority=0):
    Use this command to replace KVStore push and pull operation.
    pushpull API is a new interface for “dist_sync_mpi” KVStore and MPI-based distributed training. It fuses the original push and pull API of KVStore into one API and 
    offers a convenient approach to aggregate tensors with MPI allreduce APIs.

  3. broadcast(self, key, value, root_rank, priority=0):
    Use this command to broadcast tensors in root_rank to all other nodes
    broadcast API is a new interface for "dist_sync_mpi" KVStore and MPI-based distributed training. It will broadcast the value of tensor in root_rank to all other nodes with MPI broadcast APIs.

  4. push(self, key, value, priority=0):
    Not supported in kvstore with type dist_sync_mpi

  5. pull(self, key, out=None, priority=0):
    Not supported in kvstore with type dist_sync_mpi

  6. row_sparse_pull(self, key, out=None, priority=0, row_ids=None):
    Not supported in kvstore with type dist_sync_mpi

  7. set_gradient_compression(self, compression_params):
    Specifies type of low-bit quantization for gradient compression and additional arguments depending on the type of compression being used. Currently it's not supported in kvstore with type dist_sync_mpi

  8. set_optimizer(self, optimizer):
    Not supported in kvstore with type dist_sync_mpi

  9. type(self):
    Returns the type of this kvstore.

  10. Rank(self):
    Returns the index of the current process in the MPI group.

  11. num_workers(self):
    Returns the number of running MPI processes.

  12. save_optimizer_states(self, fname, dump_optimizer=False):
    Not supported in kvstore with type dist_sync_mpi

  13. def load_optimizer_states(self, fname):
    Not supported in kvstore with type dist_sync_mpi

    

...