Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

View file
namenccl.pdf
height400
View file
nameallreduce.pdf
height400

                        (a) Ring algorithm used by NCCL                                    (b) Parameter server algorithm

...

Figure 2 explains the end-to-end performance results (see Figure x8) that show Parameter server is faster for networks that require many Push-Pulls over relatively small keys (e.g. ResNet-50, Inception-48 need over 157 Push-Pulls on keys not exceeding 2M floats in size), but NCCL ring Reduce is faster for networks that only need Push-Pull over few keys (e.g. VGG-16 and AlexNet only need fewer than 32 Push-Pulls on keys that exceed 10M in size).

...

When to switch between Single and Multiple tree



Figure 67. VGG-16 performance as function of MXNET_KVSTORE_BIGARRAY_BOUND using batch size 4 per GPU.

...


Vs. Parameter Server (in comm.h)Vs. NCCL (in kvstore_nccl.h)
Resnet-501.191.33
VGG-165.891.06
Inception-v31.151.34
AlexNet6.601.42

Figure 78. End-to-end training results on synthetic data showing speed-up vs. NCCL on fp32 and fp16.

...