Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
outlinetrue

Introduction

This page details benchmark results comparing MXNet 1.3.0 with MKLDNN vs without MKLDNN (integration proposal). The results clearly shows that MKL-DNN boosts inference throughput between 6x to 37x, latency reduced between 2x to 41x, while accuracy is equivalent up to an epsilon of  1e-8.

Inference Performance

This group of the performance test is gathered on AWS EC2 instances in instance C5.18xLarge with 1 socket and 1 processor.

...

For the throughput, 2 sockets can provide about 2X speedup while latency will keep the constant.

Performance on Intel CPU with Intel MKL-DNN backend in release 1.3

    • w/o MKL-DNN, pip install mxnet==1.3.0

...

The c5.18xlarge instance offers a 2-socket Intel Xeon Platinum processor with 72 vCPUs.

$ export KMP_AFFINITY=granularity=fine,compact,1,0

$ export OMP_NUM_THREADS=18

$ numactl --physcpubind=0-17 --membind=0 python …


CategoryModelLatency batchsize=1 (ms, small is better)Throughput batchsize=128 (fps, big is better)
w/o MKL-DNNw/ MKL-DNNspeedupw/o MKL-DNNw/ MKL-DNNspeedup
CNN/classificationResNet-50 v197.1913.047.4510.29163.5215.90
ResNet-50 v298.6913.027.589.94154.1715.51
Inception v3175.1716.7710.445.74135.3323.57
Inception v4330.9331.4010.543.0469.6022.87
DenseNet111.6618.905.918.52149.8817.60
MobileNet38.564.428.7324.87512.2520.60
VGG16406.5020.0720.252.9170.8424.31
AlexNet64.603.8017.0026.58965.2036.32
inception-resnet v2181.1049.403.675.4882.9715.14
CNN/object detectionFaster R-CNN1175.74118.629.910.858.5710.08
SSD-VGG16721.0347.6215.141.43(batchsize=224)28.90(batchsize=224)19.13
SSD-MobileNet239.4028.338.454.07(batchsize=256)69.97(batchsize=256)14.18
RNNGNMT
GANDCGAN
  • Performance gain from operator fusion by subgraph

683.4394.007.271.46(batchsize=64)10.63(batchsize=64)6.83
GANDCGAN8.940.2437.85109.134249.3638.94

Performance AMD CPU with Intel MKL-DNN backend in release 1.3

The m5a.24xlarge offers 96 vCPUs using the AMD EPYC processors (AVX2)


CategoryModelThroughput batchsize=32 (fps, bigger is better)
w/o MKL-DNNw/ MKL-DNN

Category

Model

Latency batchsize=1 (ms, small is better)Throughput batchsize=128 (fps, big is better)R1.3 w/ MKL-DNNmaster w/ subgraphspeedupR1.3 w/ MKL-DNNmaster w/ subgraph
speedup
CNN/classificationResNet-50 v1
ResNet-50 v2Inception v3Inception v4DenseNet
2.4438.57x15.8

MobileNet
VGG16AlexNetinception-resnet v2

CNN/object detection

Faster R-CNNSSD-VGG16SSD-MobileNet

RNN

GNMTGANDCGAN

...

5.03194.7x38.7

Inference Accuracy

The c5.18xlarge instance offers a 2-socket Intel Xeon Platinum processor with 72 vCPUs.

The model is from gluon model zoo by pre-trained parameters. The top1 and top5 accuracy are verified by MKL-DNN backend. 

As below table shown, the accuracy from MXNet 1.3 without and with MKL-DNN got the exact same results with 10e-8.

Note: The dataset used ImageNet1k valdata/ are generated by imagenet1k-val.sh

GPU (with cuDNN) Backendtop5 top1 top5 
Inference Accuracy Comparison
AliasNetwork# ParametersCPU (without MKL-DNN)CPU (with MKL-DNN) BackendDelta
 top1 top5 top1 top5top1top5
alexnetAlexNet61,100,840      0.563125000.789921880.563125000.789921880.000000000.00000000
densenet121DenseNet-1218,062,504      0.742031250.919296880.742031250.919296880.000000000.00000000
densenet161DenseNet-16128,900,936      0.771953130.933906250.771953130.933906250.000000000.00000000
densenet169DenseNet-16914,307,880     0.757109380.928281250.757109380.928281250.000000000.00000000 
densenet201DenseNet-20120,242,984      0.769062500.930937500.769062500.930937500.000000000.00000000
inceptionv3Inception V3 299x29923,869,000     0.776093750.936640630.776093750.936640630.000000000.00000000 
mobilenet0.25MobileNet 0.25475,544     0.510390630.756875000.510390630.756875000.000000000.00000000 
mobilenet0.5MobileNet 0.51,342,536     0.618515630.837890630.618515630.837890630.000000000.00000000 
mobilenet0.75MobileNet 0.752,601,976     0.665468750.870703130.665468750.870703130.000000000.00000000 
mobilenet1.0MobileNet 1.04,253,864     0.700937500.891093750.700937500.891093750.000000000.00000000 
mobilenetv2_1.0MobileNetV2 1.03,539,136     0.699765630.892812500.699765630.892812500.000000000.00000000 
mobilenetv2_0.75MobileNetV2 0.752,653,864     0.682109380.880078130.682109380.880078130.000000000.00000000 
mobilenetv2_0.5MobileNetV2 0.51,983,104     0.644531250.849296880.644531250.849296880.000000000.00000000 
mobilenetv2_0.25MobileNetV2 0.251,526,856     0.508906250.745468750.508906250.745468750.000000000.00000000 
resnet18_v1ResNet-18 V111,699,112     0.708125000.894531250.708125000.894531250.000000000.00000000 
resnet34_v1ResNet-34 V121,814,696     0.739609380.916093750.739609380.916093750.000000000.00000000 
resnet50_v1ResNet-50 V125,629,032      0.760625000.930468750.760625000.930468750.000000000.00000000
resnet101_v1ResNet-101 V144,695,144      0.779375000.936171880.779375000.936171880.000000000.00000000
resnet152_v1ResNet-152 V160,404,072     0.783203130.938671880.783203130.938671880.000000000.00000000 
resnet18_v2ResNet-18 V211,695,796      0.710468750.896718750.710468750.896718750.000000000.00000000
resnet34_v2ResNet-34 V221,811,380      0.740859380.915781250.740859380.915781250.000000000.00000000
resnet50_v2ResNet-50 V225,595,060     0.767500000.931875000.767500000.931875000.000000000.00000000 
resnet101_v2ResNet-101 V244,639,412      0.781250000.940156250.781250000.940156250.000000000.00000000
resnet152_v2ResNet-152 V260,329,140      0.785546880.941406250.785546880.941406250.000000000.00000000
squeezenet1.0SqueezeNet 1.01,248,424      0.572734380.795546880.572734380.795546880.000000000.00000000
squeezenet1.1SqueezeNet 1.11,235,496      0.570234380.796015630.570234380.796015630.000000000.00000000
vgg11VGG-11132,863,336      0.670625000.875312500.670625000.875312500.000000000.00000000
vgg13VGG-13133,047,848      0.681328130.879843750.681328130.879843750.000000000.00000000
vgg16VGG-16138,357,544      0.720625000.905859380.720625000.905859380.000000000.00000000
vgg19VGG-19143,667,240      0.734687500.910000000.734687500.910000000.000000000.00000000
vgg11_bnVGG-11 with batch normalization132,874,344      0.689531250.888828130.689531250.888828130.000000000.00000000
vgg13_bnVGG-13 with batch normalization133,059,624      0.698359380.889531250.698359380.889531250.000000000.00000000
vgg16_bnVGG-16 with batch normalization138,374,440     0.722265630.903906250.722265630.903906250.000000000.00000000 
vgg19_bnVGG-19 with batch normalization143,689,256     0.729921880.909921880.729921880.909921880.000000000.00000000


CMD for Reproducing Result

Please access the script and model from the link below.

https://drive.google.com/open?id=17JenLnZKsmPoZIIyktINFfMjZtDY2Ehc 

(Note: select the parent folder and click download in the drop-down menu)

You can refer to launch_benchmark_aws.sh for reproducing.