Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
outlinetrue

Introduction

This page details benchmark results comparing MXNet 1.3.0 with MKLDNN vs without MKLDNN (integration proposal). The results clearly shows that MKL-DNN boosts inference throughput between 6x to 37x, latency reduced between 2x to 41x, while accuracy is equivalent up to an epsilon of  1e-8.

Inference Performance

This group of the performance test is gathered on AWS EC2 instances with instance C5.18xLarge with 1 socket and 1 processor.

For the throughput, 2 sockets can provide about 2X speedup while latency will keep the constant.

Performance

...

on Intel CPU with Intel MKL-DNN backend in release 1.3

    • w/o MKL-DNN, pip install mxnet==1.3.0

...

The c5.18xlarge instance offers a 2-socket Intel Xeon Platinum processor with 72 vCPUs.

$ export KMP_AFFINITY=granularity=fine,compact,1,0

$ export OMP_NUM_THREADS=18

$ numactl --physcpubind=0-17 --membind=0 python …


CategoryModelLatency batchsize=1 (ms, small is better)Throughput batchsize=128 (fps, big is better)

no mkldnn

release 1.3 + mkldnn

speedup

no mkldnn

w/o MKL-DNNw/ MKL-DNNspeedupw/o MKL-DNNw/ MKL-DNN
release 1.3 + mkldnn
speedup
CNN/classificationResNet-50 v197.19
18
13.
94
04
5
7.
13
4510.29
132
163.
05
52
12
15.
84
90
ResNet-50 v298.69
18
13.
93
02
5
7.
21
589.94
127
154.17
12
15.
79
51
Inception v3175.17
26
16.
34
77
6
10.
65
445.74
110
135.
00
33
19
23.
16
57
Inception v4330.93
66
31.
96
40
4
10.
94
543.04
59
69.
28
60
19
22.
47
87
DenseNet111.66
53
18.
31
90
2
5.
09
918.52
121
149.
79
88
14
17.
30
60
MobileNet38.56
7
4.
32
42
5
8.
27
7324.87
380
512.
54
25
15
20.
30
60
VGG16406.50
40
20.
08
07
10
20.
14
252.91
69
70.84
23
24.
96
31
AlexNet64.60
4
3.
33
80
14
17.
90
0026.58
689
965.
86
20
25
36.
96
32
inception-resnet v2181.10
111
49.
28
40
1
3.
63
675.48
69
82.
39
97
12
15.
66
14
CNN/object detectionFaster R-CNN1175.74
95
118.
15
62
12
9.
36
910.85
10
8.
51
57
12
10.
36
08
SSD-VGG16721.03
127
47.
48
62
5
15.
66
141.43(batchsize=224)
27
28.
35
90(batchsize=224)19.13
SSD-MobileNet
 
239.40
100
28.
75
33

 

 

8.454.07(batchsize=256)69.97
57.73
(batchsize=256)
 
14.18
RNNGNMT683.43
100
94.
30
00
6
7.
81
271.46(batchsize=64)
9
10.
97
63(batchsize=64)6.83
GANDCGAN8.940.
22
24
41
37.
36
85109.13
4059
4249.
74
36
37
38.
20
  • Performance gain from operator fusion by subgraph

94

Performance AMD CPU with Intel MKL-DNN backend in release 1.3

The m5a.24xlarge offers 96 vCPUs using the AMD EPYC processors (AVX2)


Category

Model

Latency batchsize=1 (ms, small is better)Throughput batchsize=128 (fps, big is better)R1.3 w/ MKL-DNNmaster w/ subgraphspeedupR1.3
CategoryModelThroughput batchsize=32 (fps, bigger is better)
w/o MKL-DNN
w/ MKL-DNN
master w/ subgraph
speedup
CNN/classificationResNet-50 v1
ResNet-50 v2Inception v3Inception v4DenseNetMobileNetVGG16AlexNetinception-resnet v2

CNN/object detection

Faster R-CNNSSD-VGG16SSD-MobileNet

RNN

GNMTGANDCGAN

...

2.4438.57x15.8
MobileNet5.03194.7x38.7

Inference Accuracy

The c5.18xlarge instance offers a 2-socket Intel Xeon Platinum processor with 72 vCPUs.

The model is from gluon model zoo by pre-trained parameters. The top1 and top5 accuracy are verified by MKL-DNN backend. 

As below table shown, the accuracy from MXNet 1.3 without and with MKL-DNN got the exact same results with 10e-8.

Note: The dataset used ImageNet1k valdata/ are generated by imagenet1k-val.sh

GPU (with cuDNN) Backendtop5 top1 top5 
Inference Accuracy Comparison
AliasNetwork# ParametersCPU (without MKL-DNN)CPU (with MKL-DNN) BackendDelta
 top1 top5 top1 top5top1top5
alexnetAlexNet61,100,840      0.563125000.789921880.563125000.789921880.000000000.00000000
densenet121DenseNet-1218,062,504     0.742031250.919296880.742031250.919296880.000000000.00000000 
densenet161DenseNet-16128,900,936     0.771953130.933906250.771953130.933906250.000000000.00000000 
densenet169DenseNet-16914,307,880      0.757109380.928281250.757109380.928281250.000000000.00000000
densenet201DenseNet-20120,242,984     0.769062500.930937500.769062500.930937500.000000000.00000000 
inceptionv3Inception V3 299x29923,869,000      0.776093750.936640630.776093750.936640630.000000000.00000000
mobilenet0.25MobileNet 0.25475,544     0.510390630.756875000.510390630.756875000.000000000.00000000 
mobilenet0.5MobileNet 0.51,342,536     0.618515630.837890630.618515630.837890630.000000000.00000000 
mobilenet0.75MobileNet 0.752,601,976      0.665468750.870703130.665468750.870703130.000000000.00000000
mobilenet1.0MobileNet 1.04,253,864      0.700937500.891093750.700937500.891093750.000000000.00000000
mobilenetv2_1.0MobileNetV2 1.03,539,136     0.699765630.892812500.699765630.892812500.000000000.00000000 
mobilenetv2_0.75MobileNetV2 0.752,653,864      0.682109380.880078130.682109380.880078130.000000000.00000000
mobilenetv2_0.5MobileNetV2 0.51,983,104     0.644531250.849296880.644531250.849296880.000000000.00000000 
mobilenetv2_0.25MobileNetV2 0.251,526,856      0.508906250.745468750.508906250.745468750.000000000.00000000
resnet18_v1ResNet-18 V111,699,112     0.708125000.894531250.708125000.894531250.000000000.00000000 
resnet34_v1ResNet-34 V121,814,696     0.739609380.916093750.739609380.916093750.000000000.00000000 
resnet50_v1ResNet-50 V125,629,032     0.760625000.930468750.760625000.930468750.000000000.00000000 
resnet101_v1ResNet-101 V144,695,144     0.779375000.936171880.779375000.936171880.000000000.00000000 
resnet152_v1ResNet-152 V160,404,072      0.783203130.938671880.783203130.938671880.000000000.00000000
resnet18_v2ResNet-18 V211,695,796      0.710468750.896718750.710468750.896718750.000000000.00000000
resnet34_v2ResNet-34 V221,811,380      0.740859380.915781250.740859380.915781250.000000000.00000000
resnet50_v2ResNet-50 V225,595,060     0.767500000.931875000.767500000.931875000.000000000.00000000 
resnet101_v2ResNet-101 V244,639,412     0.781250000.940156250.781250000.940156250.000000000.00000000 
resnet152_v2ResNet-152 V260,329,140      0.785546880.941406250.785546880.941406250.000000000.00000000
squeezenet1.0SqueezeNet 1.01,248,424     0.572734380.795546880.572734380.795546880.000000000.00000000 
squeezenet1.1SqueezeNet 1.11,235,496     0.570234380.796015630.570234380.796015630.000000000.00000000 
vgg11VGG-11132,863,336     0.670625000.875312500.670625000.875312500.000000000.00000000 
vgg13VGG-13133,047,848      0.681328130.879843750.681328130.879843750.000000000.00000000
vgg16VGG-16138,357,544      0.720625000.905859380.720625000.905859380.000000000.00000000
vgg19VGG-19143,667,240     0.734687500.910000000.734687500.910000000.000000000.00000000 
vgg11_bnVGG-11 with batch normalization132,874,344      0.689531250.888828130.689531250.888828130.000000000.00000000
vgg13_bnVGG-13 with batch normalization133,059,624     0.698359380.889531250.698359380.889531250.000000000.00000000 
vgg16_bnVGG-16 with batch normalization138,374,440      0.722265630.903906250.722265630.903906250.000000000.00000000
vgg19_bnVGG-19 with batch normalization143,689,256     0.729921880.909921880.729921880.909921880.000000000.00000000


CMD for Reproducing Result

Please access the script and model from the link below.

https://drive.google.com/open?id=17JenLnZKsmPoZIIyktINFfMjZtDY2Ehc 

(Note: select the parent folder and click download in the drop-down menu)

You can refer to launch_benchmark_aws.sh for reproducing.