Page History

...

At the time of writing MXNet uses a bundled as a submodule version of OpenMP which is from 11/2017. It's pull pulled from a specific revision and built, then explicitly linking to it. The proposed by the compiler library is not removed. When built with MKLML the intel version is explicitly removed from linked libraries.

...

Having more than one OpenMP runtime initialised may lead to undefined behaviour including incorrect results or crashes.

A A discussion has been started on the dev list to review a possible solution to the problem.

...

Code Block

language	bash
theme	MidnightConfluence

cmake \
    -DUSE_CUDA=OFF \
    -DWITH_TESTS=OFF \ 
    -DWITH_EXAMPLES=OFF \
    -DCMAKE_CXX_COMPILER=$CXXCOMP \
    -DCMAKE_C_COMPILER=$CCOMP \
    -DMKLDNN_THREADING=$THREADING \
    $LD_ARG ..

...

Compilers and OpenMP implementations

Treatment groups

		ID	Compiler	OpenMP	MKL
1	clang3_gnu	Clang 3.8.0	Native OMP	mklml_gnu
2	clang3_intel	Clang 3.8.0	Intel OMP	mklml_intel
3	gcc5_gnu	GCC 5.4.0	Native GOMP	mklml_gnu
4	gcc5_intel	GCC 5.4.0	Intel OMP	mklml_intel
5	clang7_gnu	Clang 7.0.1	Native OMP	mklml_gnu
6	clang7_intel	Clang 7.0.1	Intel OMP	mklml_intel
7	gcc8_gnu	GCC 8.1.0	Native GOMP	mklml_gnu
8	gcc8_intel	GCC 8.1.0	Intel OMP	mklml_intel
9	intel19_intel	Intel Compiler 19.0.1	Native Intel OMP	mklml_intel

Control groups

	ID	Compiler	OpenMP	MKL
1	clang3_omp	Clang 3.8.0	Provided OMP	mklml_gnu
2	gcc5_omp	GCC 5.4.0	Provided OMP	mklml_gnu
3	clang7_omp	Clang 7.0.1	Provided OMP	mklml_gnu
4	gcc8_omp	GCC 8.1.0	Provided OMP	mklml_gnu
5	intel19_omp	Intel Compiler 19.0.1	Native Intel OMP	mklml_gnu

...

We have not limited the usage of the sockets contrary to the second source.

Environment

		Variable	Value
1	KMP_AFFINITY	granularity=fine,noduplicates,compact,1,0
2	OMP_NUM_THREADS	36
3	GOMP_CPU_AFFINITY	0-71

General score

...

With increasing models/batch sizes we expect it to be dominated by the actual matrix operations.

Convolutional benchmark

AlexNet

Let's take a look at the smaller AlexNet, since it's expected to show the most differences.

Control group shows as expected almost no difference between different setups – againrecall, we use same OpenMP and precompiled MKL.

...

Same behaviour we see in the treatment group no matter which OpenMP is used.

Control group Image Modified

Control group

Treatment group shows no difference other than that "GCC-swing". Normalizing the data gives us average scores with ~1% difference, which is close to standard error.

Image Modified

Treatment group

ResNet152

Image Modified

Now we can observe a beautiful saturation of the throughput. Optimal batch size is between 16 and 32.

...

Control group shows the following pretty close numbers:

...

	ID	Score	Std.err
1	clang3_omp	1	0
2	clang7_omp	1.01157	0.02027
3	gcc5_omp	1.00581	0.01914
4	gcc8_omp	1.00795	0.0192
5	intel19_omp	1.0093	0.0192

Combining the treatment group with clang7_omp which is the best performer (again, with devastating margin of 1%) of the control group we have the following data.

	ID	Score	Std. err
1	clang3_gnu	1	0
2	clang3_intel	1.00051	0.01739
3	clang7_gnu	1.014	0.02055
4	clang7_intel	1.01186	0.01899
5	gcc5_gnu	0.98937	0.01913
6	gcc5_intel	1.0083	0.01696
7	gcc8_gnu	0.98195	0.01961
8	gcc8_intel	1.00822	0.01723
9	intel19_intel	1.00486	0.01756
10	clang7_omp	1.01215	0.01777

We can see pretty obvious patterns.

Newer compilers perform better than the older.
GOMP is slower than IOMP.

But the overall differences are pretty close to standard error and don't even reach 2%.

faster-rcnn Benchmark

...

Page tree

Versions Compared

Old Version 3

New Version Current