Google doc as below and welcome to comment and modify:

https://docs.google.com/document/d/1e4anwDiS18cWP49FAghU6tqqdtnRKUcbNJJxvhIfvIA/edit#heading=h.t762l56r1094

...

Machine: SKX6148, Network: 10GbE, Topology: VGG16, Local Batch Size: 64, KVStore Type: dist_sync. Parameter Server

work num	server num	Per Node FPS(pic/s)	Scaling Efficiency
8	8(worker and server share node)	19.87	67.81%
8	8	27.3	93.17%
8	4	22.7	77.47%
8	2	11.11	37.90%

Command line: python tools/launch.py -n 8 -s <server_num> --launcher ssh -H hosts python example/image-classification/train_vgg16.py --kv-store dist_sync

Following is the result of MXNet multinode with mpi allreduce supported from our proof of concept (ready):

Node Num	Per Node FPS(pic/s)	Scaling Efficiency
8	27.76	94.74%

Command line: mpirun -n 8 -ppn 1 -machinefile hosts python example/image-classification/train_vgg16.py --kv-store dist_sync_mpi

...

Following is allreduce benchmark (400M payload 8 worker num) for reference:

Method	server num	worker num	Time(s)
ParameterServer (push+pull)	1	8	6.5
ParameterServer (push+pull)	2	8	3.4
ParameterServer (push+pull)	4	8	2.0
ParameterServer (push+pull)	8	8	1.2
MPI.Allreduce	N/A	8	1.0

From the performance data, we can draw the following conclusions:

...

Moreover, we noticed that most of mainstream Deep Learning frameworks have mpi-allreduce-based distributed training mechnism:

Framework	Distributed Communication Mechanism
Tensorflow	PS + mpi-allreduce(baidu allreduce) (uber horovod)
MXNet	PS
Caffe	mpi-allreduce
Torch + PyTorch	mpi-allreduce
Chainer	mpi-allreduce(mpi4py)

Goal

Besides existing distributed training mechanism parameter server in MXNet, we suggest to add mpi-allreduce as an alternative distributed training mechanism which can significantly enhance multi-node scaling efficiency for synchronous SGD distributed training with least cost.

...

Page tree

Versions Compared

Old Version 8

New Version 9

Key

Google doc as below and welcome to comment and modify:

Goal

Page tree

Page History

Versions Compared

Old Version 8

New Version 9

Key

Google doc as below and welcome to comment and modify:

Goal