Page History

...

Thus, an application can include multiple OpenMP implementations. The explicitly built and linked, the one linked implicitly by the compiler and the one provided with mklml_intel.

As stated here:

Having more than one OpenMP runtime initialised may lead to undefined behaviour including incorrect results or crashes.

A discussion has been started on the dev list to review a possible solution to the problem.

Currently, we assume these issues might be realatedrelated:

Failed OpenMP assertion when loading MXNet compiled with DEBUG=1
https://github.com/apache/incubator-mxnet/issues/10856
libomp.so dependency (need REAL fix)
https://github.com/apache/incubator-mxnet/issues/11417
mxnet-mkl (v0.12.0) crash when using (conda-installed) numpy with MKL
https://github.com/apache/incubator-mxnet/issues/8532
Performance regression when OMP_NUM_THREADS environment variable is not set
https://github.com/apache/incubator-mxnet/issues/9744
Poor concat CPU performance on CUDA builds
https://github.com/apache/incubator-mxnet/issues/11905
Poor performance of the libmxnet if OMP_PLACES environment variable is present
https://github.com/apache/incubator-mxnet/issues/14087

...

As of now (1/2019) we have 2 build systems: make and cmake. Current production binaries are delivered by make and the compiler optimization flags are more aggressive. CMake is under development, some settings are behind that of Make (like sse2 vs sse3). More than that, current cmake produces critically slower binaries. See:

!!!GPU performance of Cmake built mxnet is worse than Make built one
https://github.com/apache/incubator-mxnet/issues/6685

One of the reasons with CPU (e.g. not CUDA version) is the OpenBLAS preceding MKL ML in the linker commands. See:

Default cmake build uses openblas instead of MKL
https://github.com/apache/incubator-mxnet/issues/14085

...

Currently, there are several problems with MXNet compilation if compiled with ICC (Intel C++ Compiler). See:

Intel Compiler fails to build mxnet
https://github.com/apache/incubator-mxnet/issues/14086

...

We use the current CMake, considering most of the deviating flags (like SSE or explicit loop unrolling) to be insignificant for our experiments.

Code Block

language	bash
theme	Midnight

cmake

...

 \
    -DUSE_CUDA=OFF

...

 \
    -DWITH_TESTS=OFF

...

 \ 
    -DWITH_EXAMPLES=OFF

...

 \
    -DCMAKE_CXX_COMPILER=$CXXCOMP

...

 \
    -DCMAKE_C_COMPILER=$CCOMP

...

 \
    -DMKLDNN_THREADING=$THREADING

...

 \
    $LD_ARG ..

See details in the attached benchmark.sh file.

...

Obviously, two factors contribute to the performance values:

OpenMP implementation
Quality of generated machine code

...

Same behaviour we see in the treatment group no matter which OpenMP is used.

Image Added

Control group

Treatment group shows no difference other than that "GCC-swing". Normalizing the data gives us average scores with ~1% difference, which is close to standard error.

Image Added

Treatment group

ResNet152

Image AddedImage Added
Now we can observe a beautiful saturation of the throughput. Optimal batch size is between 16 and 32.

...

Very similar data we get for other models.

Total scores

Control group shows the following pretty close numbers:

	ID	Score	Std.err
1	clang3_omp	1	0
2	clang7_omp	1.01157	0.02027
3	gcc5_omp	1.00581	0.01914
4	gcc8_omp	1.00795	0.0192
5	intel19_omp	1.0093	0.0192

Combining the treatment group with clang7_omp which is the best performer (again, with devastating margin of 1%) of the control group we have the following data.

	ID	Score	Std. err
1	clang3_gnu	1	0
2	clang3_intel	1.00051	0.01739
3	clang7_gnu	1.014	0.02055
4	clang7_intel	1.01186	0.01899
5	gcc5_gnu	0.98937	0.01913
6	gcc5_intel	1.0083	0.01696
7	gcc8_gnu	0.98195	0.01961
8	gcc8_intel	1.00822	0.01723
9	intel19_intel	1.00486	0.01756
10	clang7_omp	1.01215	0.01777

...

But the overall differences are pretty close to standard error and don't even reach 2%.

faster-rcnn Benchmark

Image Added

As we can see, GOMP delivers ~3-5% worse performance than OMP.

...

Page tree

Versions Compared

Old Version 1

New Version 2

Key

faster-rcnn Benchmark