Problem Statement

MXNet inherent distributed training mechanism, parameter server, provides efficient communication in ASGD and fault tolerance especially in cloud environment. But we found that in small-scale number of nodes (8-64) mpi allreduce can achieved the scaling efficiency close to linear while there's no extra server node deployment. So we suggest to add mpi-allreduce as an alternative choice for customer in MXNet multi-node distributed training.

...

Page tree

Versions Compared

Old Version 5

New Version 6

Key

Good doc as below and welcome to comment and modify:

Problem Statement

Page tree

Page History

Versions Compared

Old Version 5

New Version 6

Key

Good doc as below and welcome to comment and modify:

Problem Statement