Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: close the proposal and PR

PR is closed and the Horovod solution will be used for the distributed training.

The current implementation is partially merged into Horovod solution.

Therefore, the new solution will take the advantages of both this PR and Horovod so it will be very nice for the community

Google doc as below and welcome to comment and modify:

https://docs.google.com/document/d/1e4anwDiS18cWP49FAghU6tqqdtnRKUcbNJJxvhIfvIA/edit#heading=h.t762l56r1094

...

.

https://github.com/apache/incubator-mxnet/pull/10696

 

Problem Statement

MXNet inherent distributed training mechanism, parameter server, provides efficient communication in ASGD and fault tolerance especially in cloud environment. But we found that in small-scale number of nodes (8-64) mpi allreduce can achieved the scaling efficiency close to linear while there's no extra server node deployment. So we suggest to add mpi-allreduce as an alternative choice for customer in MXNet multi-node distributed training. 

...