Problem Statement

MXNet inherent distributed training mechanism, parameter server, provides efficient communication in ASGD and fault tolerance especially in cloud environment. But we found that in small-scale number of nodes (8-64) mpi allreduce can achieved the scaling efficiency close to linear while there's no extra server node deployment. So we suggest to add mpi-allreduce as an alternative choice for customer in MXNet multi-node distributed training.

...

Machine: SKX6148, Network: 10GbE, Topology: VGG16, Local Batch Size: 64, KVStore Type: dist_sync. Parameter Server

work num	server num	Per Node FPS(pic/s)	Scaling Efficiency
8	8(worker and server share node)	19.87	67.81%
8	8	27.3	93.17%
8	4	22.7	77.47%
8	2	11.11	37.90%

Command line: python tools/launch.py -n 8 -s <server_num> --launcher ssh -H hosts python example/image-classification/train_vgg16.py --kv-store dist_sync

Following is the result of MXNet multinode with mpi allreduce supported from our proof of concept (ready):

Node Num	Per Node FPS(pic/s)	Scaling Efficiency
8	27.76	94.74%

Command line: mpirun -n 8 -ppn 1 -machinefile hosts python example/image-classification/train_vgg16.py --kv-store dist_sync_mpi

...

Following is allreduce benchmark (400M payload 8 worker num) for reference:

Method	server num	worker num	Time(s)
ParameterServer (push+pull)	1	8	6.5
ParameterServer (push+pull)	2	8	3.4
ParameterServer (push+pull)	4	8	2.0
ParameterServer (push+pull)	8	8	1.2
MPI.Allreduce	N/A	8	1.0

From the performance data, we can draw the following conclusions:

...

PS-based distributed training (ref here)

No coincidence, in 2016-2017, both Baidu and Uber propose their mpi-allreduce-based distributed training framework (tensorflow-allreduce, horovod) for tensorflow, considering the drawbacks we mentioned in tensorflow inherent parameter server based distributed training.

...

Moreover, we noticed that most of mainstream Deep Learning frameworks have mpi-allreduce-based distributed training mechnism:

Framework	Distributed Communication Mechanism
Tensorflow	PS + mpi-allreduce(baidu allreduce) (uber horovod)
MXNet	PS
Caffe	mpi-allreduce
Torch + PyTorch	mpi-allreduce
Chainer	mpi-allreduce(mpi4py)

Goal

Besides existing distributed training mechanism parameter server in MXNet, we suggest to add mpi-allreduce as an alternative distributed training mechanism which can significantly enhance multi-node scaling efficiency for synchronous SGD distributed training with least cost.

...

Page tree

Versions Compared

Old Version 7

New Version Current

Key

PR is closed and the Horovod solution will be used for the distributed training.

The current implementation is partially merged into Horovod solution.

Therefore, the new solution will take the advantages of both this PR and Horovod so it will be very nice for the community.

https://docs.googlegithub.com/documentapache/dincubator-mxnet/1e4anwDiS18cWP49FAghU6tqqdtnRKUcbNJJxvhIfvIA/edit#heading=h.t762l56r1094pull/10696

Problem Statement

Goal

Page tree

Page History

Versions Compared

Old Version 7

New Version Current

Key

PR is closed and the Horovod solution will be used for the distributed training.

The current implementation is partially merged into Horovod solution.

Therefore, the new solution will take the advantages of both this PR and Horovod so it will be very nice for the community.

https://docs.googlegithub.com/documentapache/dincubator-mxnet/1e4anwDiS18cWP49FAghU6tqqdtnRKUcbNJJxvhIfvIA/edit#heading=h.t762l56r1094pull/10696

Problem Statement

Goal