Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Google doc as below and welcome to comment and modify:

https://docs.google.com/document/d/1e4anwDiS18cWP49FAghU6tqqdtnRKUcbNJJxvhIfvIA/edit#heading=h.t762l56r1094

...

Machine: SKX6148, Network: 10GbE, Topology: VGG16, Local Batch Size: 64, KVStore Type: dist_sync. Parameter Server

work numserver numPer Node FPS(pic/s)Scaling Efficiency
88(worker and server share node)19.8767.81%
8827.393.17%
8422.777.47%
8211.1137.90%

Command line: python tools/launch.py -n 8 -s <server_num> --launcher ssh -H hosts python example/image-classification/train_vgg16.py --kv-store dist_sync


Following is the result of MXNet multinode with mpi allreduce supported from our proof of concept (ready):

Node NumPer Node FPS(pic/s)Scaling Efficiency
827.7694.74%

Command line: mpirun -n 8 -ppn 1 -machinefile hosts python example/image-classification/train_vgg16.py --kv-store dist_sync_mpi

...

Following is allreduce benchmark (400M payload 8 worker num) for reference:

Methodserver numworker numTime(s)
ParameterServer (push+pull)186.5
ParameterServer (push+pull)283.4
ParameterServer (push+pull)482.0
ParameterServer (push+pull)881.2
MPI.AllreduceN/A81.0


From the performance data, we can draw the following conclusions:

...

Moreover, we noticed that most of mainstream Deep Learning frameworks have mpi-allreduce-based distributed training mechnism:

FrameworkDistributed Communication Mechanism
Tensorflow

PS + mpi-allreduce(baidu allreduce)
(uber horovod)

MXNetPS
Caffempi-allreduce
Torch + PyTorchmpi-allreduce
Chainermpi-allreduce(mpi4py)

 

Goal

Besides existing distributed training mechanism parameter server in MXNet, we suggest to add mpi-allreduce as an alternative distributed training mechanism which can significantly enhance multi-node scaling efficiency for synchronous SGD distributed training with least cost. 

...