Table of Contents

outline	true

Introduction

This page details benchmark results comparing MXNet 1.3.0 with MKLDNN vs without MKLDNN (integration proposal). The results clearly shows that MKL-DNN boosts inference throughput between 6x to 37x, latency reduced between 2x to 41x, while accuracy is equivalent up to an epsilon of 1e-8.

Inference Performance

This group of the performance test is gathered on AWS EC2 instances with instance C5.18xLarge with 1 socket and 1 processor.

For the throughput, 2 sockets can provide about 2X speedup while latency will keep the constant.

Performance

...

on Intel CPU with Intel MKL-DNN backend in release 1.3

- w/o MKL-DNN, pip install mxnet==1.3.0

...

The c5.18xlarge instance offers a 2-socket Intel Xeon Platinum processor with 72 vCPUs.

$ export KMP_AFFINITY=granularity=fine,compact,1,0

$ export OMP_NUM_THREADS=18

$ numactl --physcpubind=0-17 --membind=0 python …

Category	Model	Latency batchsize=1 (ms, small is better)			Throughput batchsize=128 (fps, big is better)

no mkldnn

release 1.3 + mkldnn

speedup

no mkldnn

w/o MKL-DNN	w/ MKL-DNN	speedup	w/o MKL-DNN	w/ MKL-DNN

release 1.3 + mkldnn

	speedup
CNN/classification	ResNet-50 v1	97.19

18

13.

94

04

5

7.

13

45

10.29

132

163.

05

52

12

15.

84

90
ResNet-50 v2	98.69

18

13.

93

02

5

7.

21

58

9.94

127

154.17

12

15.

79

51
Inception v3	175.17

26

16.

34

77

6

10.

65

44

5.74

110

135.

00

33

19

23.

16

57
Inception v4	330.93

66

31.

96

40

4

10.

94

54

3.04

59

69.

28

60

19

22.

47

87
DenseNet	111.66

53

18.

31

90

2

5.

09

91

8.52

121

149.

79

88

14

17.

30

60
MobileNet	38.56

7

4.

32

42

5

8.

27

73

24.87

380

512.

54

25

15

20.

30

60
VGG16	406.50

40

20.

08

07

10

20.

14

25

2.91

69

70.84

23

24.

96

31
AlexNet	64.60

4

3.

33

80

14

17.

90

00

26.58

689

965.

86

20

25

36.

96

32
inception-resnet v2	181.10

111

49.

28

40

1

3.

63

67

5.48

69

82.

39

97

12

15.

66

14
CNN/object detection	Faster R-CNN	1175.74

95

118.

15

62

12

9.

36

91

0.85

10

8.

51

57

12

10.

36

08
SSD-VGG16	721.03

127

47.

48

62

5

15.

66

14	1.43（batchsize=224)

27

28.

35

90(batchsize=224)	19.13
SSD-MobileNet

239.40

100

28.

75

33

8.45

4.07(batchsize=256)

69.97

57.73

(batchsize=256)

14.18
RNN	GNMT	683.43

100

94.

30

00

6

7.

81

27	1.46(batchsize=64)

9

10.

97

63(batchsize=64)	6.83
GAN	DCGAN	8.94	0.

22

24

41

37.

36

85

109.13

4059

4249.

74

36

37

38.

20

Performance gain from operator fusion by subgraph

- R1.3 w/ MKL-DNN, pip install mxnet-mkl==1.3.0
- master w/ subgraph, CI https://github.com/apache/incubator-mxnet/commit/213ab09e7a2924da436c0d0526d62fefeeea6aa7
  build: make USE_OPENCV=1 USE_MKLDNN=1 USE_BLAS=mkl USE_INTEL_PATH=/opt/intel/ -j
  runtime env: export MXNET_SUBGRAPH_BACKEND=MKLDNN

94

Performance AMD CPU with Intel MKL-DNN backend in release 1.3

The m5a.24xlarge offers 96 vCPUs using the AMD EPYC processors (AVX2)

Inference Accuracy

The c5.18xlarge instance offers a 2-socket Intel Xeon Platinum processor with 72 vCPUs.

The model is from gluon model zoo by pre-trained parameters. The top1 and top5 accuracy are verified by MKL-DNN backend.

As below table shown, the accuracy from MXNet 1.3 without and with MKL-DNN got the exact same results with 10e-8.

Note: The dataset used ImageNet1k valdata/ are generated by imagenet1k-val.sh

GPU (with cuDNN) Backendtop5 top1 top5

Inference Accuracy Comparison
Alias	Network	# Parameters	CPU (without MKL-DNN)		CPU (with MKL-DNN) Backend		Delta
Alias	Network	# Parameters	top1	top5	top1	top5	top1	top5
alexnet	AlexNet	61,100,840						0.56312500	0.78992188	0.56312500	0.78992188	0.00000000	0.00000000
densenet121	DenseNet-121	8,062,504					0.74203125	0.91929688	0.74203125	0.91929688	0.00000000	0.00000000
densenet161	DenseNet-161	28,900,936					0.77195313	0.93390625	0.77195313	0.93390625	0.00000000	0.00000000
densenet169	DenseNet-169	14,307,880						0.75710938	0.92828125	0.75710938	0.92828125	0.00000000	0.00000000
densenet201	DenseNet-201	20,242,984					0.76906250	0.93093750	0.76906250	0.93093750	0.00000000	0.00000000
inceptionv3	Inception V3 299x299	23,869,000						0.77609375	0.93664063	0.77609375	0.93664063	0.00000000	0.00000000
mobilenet0.25	MobileNet 0.25	475,544					0.51039063	0.75687500	0.51039063	0.75687500	0.00000000	0.00000000
mobilenet0.5	MobileNet 0.5	1,342,536					0.61851563	0.83789063	0.61851563	0.83789063	0.00000000	0.00000000
mobilenet0.75	MobileNet 0.75	2,601,976						0.66546875	0.87070313	0.66546875	0.87070313	0.00000000	0.00000000
mobilenet1.0	MobileNet 1.0	4,253,864						0.70093750	0.89109375	0.70093750	0.89109375	0.00000000	0.00000000
mobilenetv2_1.0	MobileNetV2 1.0	3,539,136					0.69976563	0.89281250	0.69976563	0.89281250	0.00000000	0.00000000
mobilenetv2_0.75	MobileNetV2 0.75	2,653,864						0.68210938	0.88007813	0.68210938	0.88007813	0.00000000	0.00000000
mobilenetv2_0.5	MobileNetV2 0.5	1,983,104					0.64453125	0.84929688	0.64453125	0.84929688	0.00000000	0.00000000
mobilenetv2_0.25	MobileNetV2 0.25	1,526,856						0.50890625	0.74546875	0.50890625	0.74546875	0.00000000	0.00000000
resnet18_v1	ResNet-18 V1	11,699,112					0.70812500	0.89453125	0.70812500	0.89453125	0.00000000	0.00000000
resnet34_v1	ResNet-34 V1	21,814,696					0.73960938	0.91609375	0.73960938	0.91609375	0.00000000	0.00000000
resnet50_v1	ResNet-50 V1	25,629,032					0.76062500	0.93046875	0.76062500	0.93046875	0.00000000	0.00000000
resnet101_v1	ResNet-101 V1	44,695,144					0.77937500	0.93617188	0.77937500	0.93617188	0.00000000	0.00000000
resnet152_v1	ResNet-152 V1	60,404,072						0.78320313	0.93867188	0.78320313	0.93867188	0.00000000	0.00000000
resnet18_v2	ResNet-18 V2	11,695,796						0.71046875	0.89671875	0.71046875	0.89671875	0.00000000	0.00000000
resnet34_v2	ResNet-34 V2	21,811,380						0.74085938	0.91578125	0.74085938	0.91578125	0.00000000	0.00000000
resnet50_v2	ResNet-50 V2	25,595,060					0.76750000	0.93187500	0.76750000	0.93187500	0.00000000	0.00000000
resnet101_v2	ResNet-101 V2	44,639,412					0.78125000	0.94015625	0.78125000	0.94015625	0.00000000	0.00000000
resnet152_v2	ResNet-152 V2	60,329,140						0.78554688	0.94140625	0.78554688	0.94140625	0.00000000	0.00000000
squeezenet1.0	SqueezeNet 1.0	1,248,424					0.57273438	0.79554688	0.57273438	0.79554688	0.00000000	0.00000000
squeezenet1.1	SqueezeNet 1.1	1,235,496					0.57023438	0.79601563	0.57023438	0.79601563	0.00000000	0.00000000
vgg11	VGG-11	132,863,336					0.67062500	0.87531250	0.67062500	0.87531250	0.00000000	0.00000000
vgg13	VGG-13	133,047,848						0.68132813	0.87984375	0.68132813	0.87984375	0.00000000	0.00000000
vgg16	VGG-16	138,357,544						0.72062500	0.90585938	0.72062500	0.90585938	0.00000000	0.00000000
vgg19	VGG-19	143,667,240					0.73468750	0.91000000	0.73468750	0.91000000	0.00000000	0.00000000
vgg11_bn	VGG-11 with batch normalization	132,874,344						0.68953125	0.88882813	0.68953125	0.88882813	0.00000000	0.00000000
vgg13_bn	VGG-13 with batch normalization	133,059,624					0.69835938	0.88953125	0.69835938	0.88953125	0.00000000	0.00000000
vgg16_bn	VGG-16 with batch normalization	138,374,440						0.72226563	0.90390625	0.72226563	0.90390625	0.00000000	0.00000000
vgg19_bn	VGG-19 with batch normalization	143,689,256						0.72992188	0.90992188	0.72992188	0.90992188	0.00000000	0.00000000

CMD for Reproducing Result

Please access the script and model from the link below.

https://drive.google.com/open?id=17JenLnZKsmPoZIIyktINFfMjZtDY2Ehc

(Note: select the parent folder and click download in the drop-down menu)

You can refer to launch_benchmark_aws.sh for reproducing.

2.44	38.57	x15.8
MobileNet	5.03	194.7	x38.7

Page tree

Versions Compared

Old Version 12

New Version Current

Key

Introduction

Inference Performance

Performance

on Intel CPU with Intel MKL-DNN backend in release 1.3

Performance gain from operator fusion by subgraph

Performance AMD CPU with Intel MKL-DNN backend in release 1.3

Inference Accuracy

CMD for Reproducing Result

Page tree

Page History

Versions Compared

Old Version 12

New Version Current

Key

Introduction

Inference Performance

Performance

on Intel CPU with Intel MKL-DNN backend in release 1.3

Performance gain from operator fusion by subgraph

Performance AMD CPU with Intel MKL-DNN backend in release 1.3

Inference Accuracy

CMD for Reproducing Result