Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

At the time of writing MXNet uses a bundled as a submodule version of OpenMP which is from 11/2017. It's pull pulled from a specific revision and built, then explicitly linking to it. The proposed by the compiler library is not removed. When built with MKLML the intel version is explicitly removed from linked libraries.

...

Having more than one OpenMP runtime initialised may lead to undefined behaviour including incorrect results or crashes.

 

A  A discussion has been started on the dev list to review a possible solution to the problem.

...

Code Block
languagebash
themeMidnightConfluence
cmake \
    -DUSE_CUDA=OFF \
    -DWITH_TESTS=OFF \ 
    -DWITH_EXAMPLES=OFF \
    -DCMAKE_CXX_COMPILER=$CXXCOMP \
    -DCMAKE_C_COMPILER=$CCOMP \
    -DMKLDNN_THREADING=$THREADING \
    $LD_ARG ..

...

Compilers and OpenMP implementations

Treatment groups


 

ID

Compiler

OpenMP

MKL

1

clang3_gnu

Clang 3.8.0

Native OMP

mklml_gnu

2

clang3_intel

Clang 3.8.0

Intel OMP

mklml_intel

3

gcc5_gnu

GCC 5.4.0

Native GOMP

mklml_gnu

4

gcc5_intel

GCC 5.4.0

Intel OMP

mklml_intel

5

clang7_gnu

Clang 7.0.1

Native OMP

mklml_gnu

6

clang7_intel

Clang 7.0.1

Intel OMP

mklml_intel

7

gcc8_gnu

GCC 8.1.0

Native GOMP

mklml_gnu

8

gcc8_intel

GCC 8.1.0

Intel OMP

mklml_intel

9

intel19_intel

Intel Compiler 19.0.1

Native Intel OMP

mklml_intel   

 Control groups

 

ID

Compiler

OpenMP

MKL

1

clang3_omp

Clang 3.8.0

Provided OMP

mklml_gnu

2

gcc5_omp

GCC 5.4.0

Provided OMP

mklml_gnu

3

clang7_omp

Clang 7.0.1

Provided OMP

mklml_gnu

4

gcc8_omp

GCC 8.1.0

Provided OMP

mklml_gnu

5

intel19_omp

Intel Compiler 19.0.1

Native Intel OMP

mklml_gnu

...

We have not limited the usage of the sockets contrary to the second source.

Environment

 

 

Variable

Value

1

KMP_AFFINITY

granularity=fine,noduplicates,compact,1,0

2

OMP_NUM_THREADS

36

3

GOMP_CPU_AFFINITY

0-71

General score

...

With increasing models/batch sizes we expect it to be dominated by the actual matrix operations.

Convolutional benchmark

AlexNet

Let's take a look at the smaller AlexNet, since it's expected to show the most differences.

Control group shows as expected almost no difference between different setups – againrecall, we use same OpenMP and precompiled MKL.

...

Same behaviour we see in the treatment group no matter which OpenMP is used.


 Control groupImage Modified

Control group


Treatment group shows no difference other than that "GCC-swing". Normalizing the data gives us average scores with ~1% difference, which is close to standard error.


Image Modified 

Treatment group


ResNet152

Image Modified

Image Modified

Now we can observe a beautiful saturation of the throughput. Optimal batch size is between 16 and 32.

...

Control group shows the following pretty close numbers:

...


ID

Score

Std.err

1

clang3_omp

1

0

2

clang7_omp

1.01157

0.02027

3

gcc5_omp

1.00581

0.01914

4

gcc8_omp

1.00795

0.0192

5

intel19_omp

1.0093

0.0192

Combining the treatment group with clang7_omp which is the best performer (again, with devastating margin of 1%) of the control group we have the following data.

 

ID

Score

Std. err

1

clang3_gnu

1

0

2

clang3_intel

1.00051

0.01739

3

clang7_gnu

1.014

0.02055

4

clang7_intel

1.01186

0.01899

5

gcc5_gnu

0.98937

0.01913

6

gcc5_intel

1.0083

0.01696

7

gcc8_gnu

0.98195

0.01961

8

gcc8_intel

1.00822

0.01723

9

intel19_intel

1.00486

0.01756

10

clang7_omp

1.01215

0.01777

We can see pretty obvious patterns.

 

  • Newer compilers perform better than the older.
  • GOMP is slower than IOMP.

But the overall differences are pretty close to standard error and don't even reach 2%.

faster-rcnn Benchmark

...

Image Modified

As we can see, GOMP delivers ~3-5% worse performance than OMP. 

...