Problem Statement

MXNet inherent distributed training mechanism, parameter server, provides efficient communication in ASGD and fault tolerance especially in cloud environment. But we found that in small-scale number of nodes (8-64) mpi allreduce can achieved the scaling efficiency close to linear while there's no extra server node deployment. So we suggest to add mpi-allreduce as an alternative choice for customer in MXNet multi-node distributed training.

...

PS-based distributed training (ref here)

No coincidence, in 2016-2017, both Baidu and Uber propose their mpi-allreduce-based distributed training framework (tensorflow-allreduce, horovod) for tensorflow, considering the drawbacks we mentioned in tensorflow inherent parameter server based distributed training.

...

init(self, key, value):
Initializes a single or a sequence of key-value pairs into the store.
Not supported in kvstore with type dist_sync_mpi
pushpull(self, key, ins, outs, priority=0):
Use this command to replace KVStore push and pull operation.
pushpull API is a new interface for “dist_sync_mpi” KVStore and MPI-based distributed training. It fuses the original push and pull API of KVStore into one API and
offers a convenient approach to aggregate tensors with MPI allreduce APIs.
broadcast(self, key, value, root_rank, priority=0):
Use this command to broadcast tensors in root_rank to all other nodes
broadcast API is a new interface for "dist_sync_mpi" KVStore and MPI-based distributed training. It will broadcast the value of tensor in root_rank to all other nodes with MPI broadcast APIs.
push(self, key, value, priority=0):
Not supported in kvstore with type dist_sync_mpi
pull(self, key, out=None, priority=0):
Not supported in kvstore with type dist_sync_mpi
row_sparse_pull(self, key, out=None, priority=0, row_ids=None):
Not supported in kvstore with type dist_sync_mpi
set_gradient_compression(self, compression_params):
Specifies type of low-bit quantization for gradient compression and additional arguments depending on the type of compression being used. Currently it's not supported in kvstore with type dist_sync_mpi
set_optimizer(self, optimizer):
Not supported in kvstore with type dist_sync_mpi
type(self):
Returns the type of this kvstore.
Rank(self):
Returns the index of the current process in the MPI group.
num_workers(self):
Returns the number of running MPI processes.
save_optimizer_states(self, fname, dump_optimizer=False):
Not supported in kvstore with type dist_sync_mpi
def load_optimizer_states(self, fname):
Not supported in kvstore with type dist_sync_mpi

https://static.googleusercontent.com/media/research.google.com/zh-CN//archive/large_deep_networks_nips2012.pdf

Page tree

Versions Compared

Old Version 7

New Version 8

Key

PR is under reviewing.

Problem Statement

Page tree

Page History

Versions Compared

Old Version 7

New Version 8

Key

PR is under reviewing.

Problem Statement