Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
outlinetrue

Introduction

This page details benchmark results comparing MXNet 1.3.0 with MKLDNN vs without MKLDNN (integration proposal). The results clearly shows that MKL-DNN boosts inference throughput between 6x to 37x, latency reduced between 2x to 41x, while accuracy is equivalent up to an epsilon of  1e-8.

Inference Performance

This group of the performance test is gathered on AWS EC2 instances

...

instance C5.18xLarge with 1 socket and 1 processor.

For the throughput, 2 sockets can provide about 2X speedup while latency will keep the constant.

Performance on Intel CPU with Intel MKL-DNN backend in release 1.3

    • w/o MKL-DNN, pip install mxnet==1.3.0

...

The c5.18xlarge instance offers a 2-socket Intel Xeon Platinum processor with 72 vCPUs.

$ export KMP_AFFINITY=granularity=fine,compact,1,0

$ export OMP_NUM_THREADS=18

$ numactl --physcpubind=0-17 --membind=0 python …


CategoryModelLatency batchsize=1 (ms, small is better)Throughput batchsize=128 (fps, big is better)
w/o MKL-DNNw/ MKL-DNNspeedupw/o MKL-DNNw/ MKL-DNNspeedup
CNN/classificationResNet-50 v197.1913.047.4510.29163.5215.90
ResNet-50 v298.6913.027.589.94154.1715.51
Inception v3175.1716.7710.445.74135.3323.57
Inception v4330.9331.4010.543.0469.6022.87
DenseNet111.6618.905.918.52149.8817.60
MobileNet38.564.428.7324.87512.2520.60
VGG16406.5020.0720.252.9170.8424.31
AlexNet64.603.8017.0026.58965.2036.32
inception-resnet v2181.1049.403.675.4882.9715.14
CNN/object detectionFaster R-CNN1175.74118.629.910.858.5710.08
SSD-VGG16721.0347.6215.141.43(batchsize=224)28.90(batchsize=224)19.13
SSD-MobileNet239.4028.338.454.07(batchsize=256)69.97(batchsize=256)14.18
RNNGNMT683.4394.007.271.46(batchsize=64)10.63(batchsize=64)6.83
GANDCGAN
  • Performance gain from operator fusion by subgraph

8.940.2437.85109.134249.3638.94

Performance AMD CPU with Intel MKL-DNN backend in release 1.3

The m5a.24xlarge offers 96 vCPUs using the AMD EPYC processors (AVX2)


Category

Model

Latency batchsize=1 (ms, small is better)Throughput batchsize=128 (fps, big is better)R1.3 w/ MKL-DNNmaster w/ subgraphspeedupR1.3
CategoryModelThroughput batchsize=32 (fps, bigger is better)
w/o MKL-DNN
w/ MKL-DNN
master w/ subgraph
speedup
CNN/classificationResNet-50 v1
ResNet-50 v2Inception v3Inception v4DenseNet
2.4438.57x15.8

MobileNet
VGG16AlexNetinception-resnet v2

CNN/object detection

Faster R-CNNSSD-VGG16SSD-MobileNet

RNN

GNMTGANDCGAN

Inference Accuracy

5.03194.7x38.7

Inference Accuracy

The c5.18xlarge instance offers a 2-socket Intel Xeon Platinum processor with 72 vCPUs.

The model is from gluon model zoo by pre-trained parameters. The top1 and top5 accuracy are verified by MKL-DNN backend. 

As below table shown, the accuracy from MXNet 1.3 without and with MKL-DNN got the exact same results with 10e-8.

Note: The dataset used ImageNet1k valdata/ are generated by imagenet1k-val.sh

Inference Accuracy   
Inference Accuracy Comparison
AliasNetwork# ParametersOfficial ResultsCPU (without MKL-DNN)CPU (with MKL-DNN) BackendTop-1 AccuracyTop-5 Accuracy Delta
 top1 top5 top1 top5top1top5
alexnetAlexNet61,100,8400.563125000.789921880.5492563125000.780378992188 0.000000000.00000000 
densenet121DenseNet-1210.742031250.919296888,062,5040.7497742031250.922591929688 0.000000000.00000000 
densenet161DenseNet-1610.771953130.9339062528,900,9360.777771953130.93893390625 0.000000000.00000000 
densenet169DenseNet-1690.757109380.9282812514,307,8800.7617757109380.931792828125 0.000000000.00000000 
densenet201DenseNet-2010.769062500.9309375020,242,9840.7732769062500.936293093750 0.000000000.00000000 
inceptionv3Inception V3 299x2990.776093750.9366406323,869,0000.7755776093750.936493664063 0.000000000.00000000 
mobilenet0.25MobileNet 0.250.510390630.75687500475,5440.5185510390630.760875687500 0.000000000.00000000 
mobilenet0.5MobileNet 0.50.618515630.837890631,342,5360.6307618515630.847583789063 0.000000000.00000000 
mobilenet0.75MobileNet 0.750.665468750.870703132,601,9760.6738665468750.878287070313 0.000000000.00000000 
mobilenet1.0MobileNet 1.00.700937500.891093754,253,8640.7105700937500.900689109375 0.000000000.00000000 
mobilenetv2_1.0MobileNetV2 1.00.699765630.892812503,539,1360.7192699765630.905689281250 0.000000000.00000000 
mobilenetv2_0.75MobileNetV2 0.750.682109380.880078132,653,8640.6961682109380.889588007813 0.000000000.00000000 
mobilenetv2_0.5MobileNetV2 0.50.644531250.849296881,983,1040.6449644531250.854784929688 0.000000000.00000000 
mobilenetv2_0.25MobileNetV2 0.250.508906250.745468751,526,8560.5074508906250.745674546875 0.000000000.00000000 
resnet18_v1ResNet-18 V10.708125000.8945312511,699,1120.7093708125000.899289453125 0.000000000.00000000 
resnet34_v1ResNet-34 V10.739609380.9160937521,814,6960.7437739609380.918791609375 0.000000000.00000000 
resnet50_v1ResNet-50 V10.760625000.9304687525,629,0320.7647760625000.931393046875 0.000000000.00000000 
resnet101_v1ResNet-101 V10.779375000.9361718844,695,1440.7834779375000.940193617188 0.000000000.00000000 
resnet152_v1ResNet-152 V10.783203130.9386718860,404,0720.79783203130.943893867188 0.000000000.00000000 
resnet18_v2ResNet-18 V20.710468750.8967187511,695,7960.71710468750.899289671875 0.000000000.00000000 
resnet34_v2ResNet-34 V20.740859380.9157812521,811,3800.744740859380.920891578125 0.000000000.00000000 
resnet50_v2ResNet-50 V20.767500000.9318750025,595,0600.7711767500000.934393187500 0.000000000.00000000 
resnet101_v2ResNet-101 V20.781250000.9401562544,639,4120.7853781250000.941794015625 0.000000000.00000000 
resnet152_v2ResNet-152 V20.785546880.9414062560,329,1400.7921785546880.943194140625 0.000000000.00000000 
squeezenet1.0SqueezeNet 1.01,248,4240.572734380.795546880.5611572734380.790979554688 0.000000000.00000000 
squeezenet1.1SqueezeNet 1.11,235,4960.570234380.796015630.5496570234380.781779601563 0.000000000.00000000 
vgg11VGG-110.670625000.87531250132,863,3360.6662670625000.873487531250 0.000000000.00000000 
vgg13VGG-130.681328130.87984375133,047,8480.6774681328130.881187984375 0.000000000.00000000 
vgg16VGG-160.720625000.90585938138,357,5440.7323720625000.913290585938 0.000000000.00000000 
vgg19VGG-190.734687500.91000000143,667,2400.7411734687500.913591000000 0.000000000.00000000 
vgg11_bnVGG-11 with batch normalization0.689531250.88882813132,874,3440.6859689531250.887288882813 0.000000000.00000000 
vgg13_bnVGG-13 with batch normalization0.698359380.88953125133,059,6240.6884698359380.888288953125 0.000000000.00000000 
vgg16_bnVGG-16 with batch normalization0.722265630.90390625138,374,4400.731722265630.917690390625 0.000000000.00000000 
vgg19_bnVGG-19 with batch normalization0.729921880.909921880.729921880.90992188143,689,2560.7433000000000.9185 .00000000


CMD for Reproducing Result

Please access the script and model from the link below.

https://drive.google.com/open?id=17JenLnZKsmPoZIIyktINFfMjZtDY2Ehc 

(Note: select the parent folder and click download in the drop-down menu)

You can refer to launch_benchmark_aws.sh for reproducing.