Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

https://docs.google.com/document/d/1e4anwDiS18cWP49FAghU6tqqdtnRKUcbNJJxvhIfvIA/edit#heading=h.t762l56r1094

PR is under reviewing.

https://github.com/apache/incubator-mxnet/pull/10696

 

Problem Statement

MXNet inherent distributed training mechanism, parameter server, provides efficient communication in ASGD and fault tolerance especially in cloud environment. But we found that in small-scale number of nodes (8-64) mpi allreduce can achieved the scaling efficiency close to linear while there's no extra server node deployment. So we suggest to add mpi-allreduce as an alternative choice for customer in MXNet multi-node distributed training. 

...

                                                 PS-based distributed training (ref here)

No coincidence, in 2016-2017, both Baidu and Uber propose their mpi-allreduce-based distributed training framework (tensorflow-allreduce, horovod) for tensorflow, considering the drawbacks we mentioned in tensorflow inherent parameter server based distributed training.

...

  1. init(self, key, value):
    Initializes a single or a sequence of key-value pairs into the store.
    Not supported in kvstore with type dist_sync_mpi

  2. pushpull(self, key, ins, outs, priority=0):
    Use this command to replace KVStore push and pull operation.
    pushpull API is a new interface for “dist_sync_mpi” KVStore and MPI-based distributed training. It fuses the original push and pull API of KVStore into one API and 
    offers a convenient approach to aggregate tensors with MPI allreduce APIs.

  3. broadcast(self, key, value, root_rank, priority=0):
    Use this command to broadcast tensors in root_rank to all other nodes
    broadcast API is a new interface for "dist_sync_mpi" KVStore and MPI-based distributed training. It will broadcast the value of tensor in root_rank to all other nodes with MPI broadcast APIs.

  4. push(self, key, value, priority=0):
    Not supported in kvstore with type dist_sync_mpi

  5. pull(self, key, out=None, priority=0):
    Not supported in kvstore with type dist_sync_mpi

  6. row_sparse_pull(self, key, out=None, priority=0, row_ids=None):
    Not supported in kvstore with type dist_sync_mpi

  7. set_gradient_compression(self, compression_params):
    Specifies type of low-bit quantization for gradient compression and additional arguments depending on the type of compression being used. Currently it's not supported in kvstore with type dist_sync_mpi

  8. set_optimizer(self, optimizer):
    Not supported in kvstore with type dist_sync_mpi

  9. type(self):
    Returns the type of this kvstore.

  10. Rank(self):
    Returns the index of the current process in the MPI group.

  11. num_workers(self):
    Returns the number of running MPI processes.

  12. save_optimizer_states(self, fname, dump_optimizer=False):
    Not supported in kvstore with type dist_sync_mpi

  13. def load_optimizer_states(self, fname):
    Not supported in kvstore with type dist_sync_mpi

    

https://static.googleusercontent.com/media/research.google.com/zh-CN//archive/large_deep_networks_nips2012.pdf