Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

There is no additional 3rd party dependency required.

Performance Benchmark

TCP/IP Network Benchmark

To demo MXNet+BytePS performance, we test two models: VGG16 (communication-intensive) and Resnet50 (computation-intensive). Both models are trained using fp32.

...

BytePS outperforms Horovod (NCCL) by 44% for Resnet50, and 100% for VGG16.

RDMA Network Benchmark

We also test the BERT-large model using fp16 on with RDMA network. The model is implemented using the gluon-nlp toolkit.

We use Tesla V100 32GB GPUs and set batch size equal to 64 per GPU. Each machine has 8 V100 GPUs with NVLink-enabled. Machines are inter-connected with 100 Gbps Infiniband network.

BytePS outperforms Horovod (carefully tuned) by 16% in this case, both with RDMA enabled.

Image Added

Limitation

BytePS currently has the following limitations:

...