Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Thus, an application can include multiple OpenMP implementations. The explicitly built and linked, the one linked implicitly by the compiler and the one provided with mklml_intel.

As stated here:

Having more than one OpenMP runtime initialised may lead to undefined behaviour including incorrect results or crashes.

 

A discussion has been started on the dev list to review a possible solution to the problem.

Currently, we assume these issues might be realatedrelated: 

...

As of now (1/2019) we have 2 build systems: make and cmake. Current production binaries are delivered by make and the compiler optimization flags are more aggressive. CMake is under development, some settings are behind that of Make (like sse2 vs sse3). More than that, current cmake produces critically slower binaries. See: 

One of the reasons with CPU (e.g. not CUDA version) is the OpenBLAS preceding MKL ML in the linker commands. See: 

...


Currently, there are several problems with MXNet compilation if compiled with ICC (Intel C++ Compiler). See: 

...

We use the current CMake, considering most of the deviating flags (like SSE or explicit loop unrolling) to be insignificant for our experiments.

Code Block
languagebash
themeMidnight
cmake

...

 \
    -DUSE_CUDA=OFF

...

 \
    -DWITH_TESTS=OFF

...

 \ 
    -DWITH_EXAMPLES=OFF

...

 \
    -DCMAKE_CXX_COMPILER=$CXXCOMP

...

 \
    -DCMAKE_C_COMPILER=$CCOMP

...

 \
    -DMKLDNN_THREADING=$THREADING

...

 \
    $LD_ARG ..

See details in the attached benchmark.sh file.

...

Obviously, two factors contribute to the performance values: 

  1. OpenMP implementation
  2. Quality of generated machine code

...

Same behaviour we see in the treatment group no matter which OpenMP is used.


 Image Added

Control group


Treatment group shows no difference other than that "GCC-swing". Normalizing the data gives us average scores with ~1% difference, which is close to standard error.


Image Added 

Treatment group


ResNet152

Image AddedImage Added
Now we can observe a beautiful saturation of the throughput. Optimal batch size is between 16 and 32.

...


Very similar data we get for other models.

 

 

Total scores

Control group shows the following pretty close numbers: 

 

ID

Score

Std.err

1

clang3_omp

1

0

2

clang7_omp

1.01157

0.02027

3

gcc5_omp

1.00581

0.01914

4

gcc8_omp

1.00795

0.0192

5

intel19_omp

1.0093

0.0192

Combining the treatment group with clang7_omp which is the best performer (again, with devastating margin of 1%) of the control group we have the following data. 

 

ID

Score

Std. err

1

clang3_gnu

1

0

2

clang3_intel

1.00051

0.01739

3

clang7_gnu

1.014

0.02055

4

clang7_intel

1.01186

0.01899

5

gcc5_gnu

0.98937

0.01913

6

gcc5_intel

1.0083

0.01696

7

gcc8_gnu

0.98195

0.01961

8

gcc8_intel

1.00822

0.01723

9

intel19_intel

1.00486

0.01756

10

clang7_omp

1.01215

0.01777

...

But the overall differences are pretty close to standard error and don't even reach 2%.

faster-rcnn Benchmark


 

Image Added

As we can see, GOMP delivers ~3-5% worse performance than OMP. 

...