Summary


We have gathered information on how MXNet performance changes when linked to different implementations of OpenMP and complied with different compilers.

Primary goal: Provide performance data points on different OpenMP implementations.

Secondary goal: Compare performance when compiled with different compilers.

Tl;dr

The difference between different compilers is insignificant. Native OpenMP implementations (more or less recent) perform equally (<5% difference). Small batch sizes during inference can indeed make difference for "narrow" networks like AlexNet, which is still minor (~5%).

Current state

LLVM OpenMP explicitly used in all builds

At the time of writing MXNet uses a bundled as a submodule version of OpenMP which is from 11/2017. It's pulled from a specific revision and built, then explicitly linking to it. The proposed by the compiler library is not removed. When built with MKLML the intel version is explicitly removed from linked libraries.

Thus, an application can include multiple OpenMP implementations. The explicitly built and linked, the one linked implicitly by the compiler and the one provided with mklml_intel.

As stated here:

Having more than one OpenMP runtime initialised may lead to undefined behaviour including incorrect results or crashes.

 A discussion has been started on the dev list to review a possible solution to the problem.

Currently, we assume these issues might be related:

Although, this setup is reproducible, we also lose the benefits of newer OpenMP versions, which are released with latest compiler releases.

Make vs CMake

As of now (1/2019) we have 2 build systems: make and cmake. Current production binaries are delivered by make and the compiler optimization flags are more aggressive. CMake is under development, some settings are behind that of Make (like sse2 vs sse3). More than that, current cmake produces critically slower binaries. See: 

One of the reasons with CPU (e.g. not CUDA version) is the OpenBLAS preceding MKL ML in the linker commands. See:

PIP distributed version contains mklml_intel.so and is built by make.

Intel Compiler Issues

Currently, there are several problems with MXNet compilation if compiled with ICC (Intel C++ Compiler). See:

Experiment setup

We have measured the performance of the code under following conditions:

Hardware

The c5.18xlarge instance offers a 2-socket Intel Xeon Platinum processor with 72 vCPUs. We have not limited the usage of the cores/sockets.

Build

We use the current CMake, considering most of the deviating flags (like SSE or explicit loop unrolling) to be insignificant for our experiments.

cmake \
    -DUSE_CUDA=OFF \
    -DWITH_TESTS=OFF \ 
    -DWITH_EXAMPLES=OFF \
    -DCMAKE_CXX_COMPILER=$CXXCOMP \
    -DCMAKE_C_COMPILER=$CCOMP \
    -DMKLDNN_THREADING=$THREADING \
    $LD_ARG ..

See details in the attached benchmark.sh file.

MXNET Source code

  1. MXNet 35c33832 was branched.
  2. We have fixed an issue with OpenBlas hiding MKL (eebc17b0)
  3. We have fixed the issues preventing us from compiling on ICC. (52648d42)
  4. We have applied the changes from this pull request to get a treatment group (24a2a8c8)
  5. As a control group we use binaries with just 2 previous changes.

Compilers and OpenMP implementations

Treatment groups


ID

Compiler

OpenMP

MKL

1

clang3_gnu

Clang 3.8.0

Native OMP

mklml_gnu

2

clang3_intel

Clang 3.8.0

Intel OMP

mklml_intel

3

gcc5_gnu

GCC 5.4.0

Native GOMP

mklml_gnu

4

gcc5_intel

GCC 5.4.0

Intel OMP

mklml_intel

5

clang7_gnu

Clang 7.0.1

Native OMP

mklml_gnu

6

clang7_intel

Clang 7.0.1

Intel OMP

mklml_intel

7

gcc8_gnu

GCC 8.1.0

Native GOMP

mklml_gnu

8

gcc8_intel

GCC 8.1.0

Intel OMP

mklml_intel

9

intel19_intel

Intel Compiler 19.0.1

Native Intel OMP

mklml_intel   

 Control groups

 

ID

Compiler

OpenMP

MKL

1

clang3_omp

Clang 3.8.0

Provided OMP

mklml_gnu

2

gcc5_omp

GCC 5.4.0

Provided OMP

mklml_gnu

3

clang7_omp

Clang 7.0.1

Provided OMP

mklml_gnu

4

gcc8_omp

GCC 8.1.0

Provided OMP

mklml_gnu

5

intel19_omp

Intel Compiler 19.0.1

Native Intel OMP

mklml_gnu

Please note, that LLVM OpenMP runtime and Intel OpenMP are highly likely just different versions of Intel OpenMP runtime, therefore we don't expect any significant differences.

Benchmark code

We have followed:

As in the both mentioned documents we use image-classification/benchmark_score.py (we will call it convolutional benchmark). Additionally, we used faster-rcnn benchmark from the second document.

We have not limited the usage of the sockets contrary to the second source.

Environment

 

Variable

Value

1

KMP_AFFINITY

granularity=fine,noduplicates,compact,1,0

2

OMP_NUM_THREADS

36

3

GOMP_CPU_AFFINITY

0-71

General score

To calculate the general score, we measure the improvement ratio vs clang3_gnu in terms of throughput. Each test was repeated 5 times.

Results discussion

Results match in their order the numbers from the mentioned source.

Obviously, two factors contribute to the performance values:

  1. OpenMP implementation
  2. Quality of generated machine code

As we mentioned the impact of the later factor is limited by the precompiled BLAS libraries. We expect the overhead of the OpenMP to be significant for small models/batch sizes.

With increasing models/batch sizes we expect it to be dominated by the actual matrix operations.

Convolutional benchmark

AlexNet

Let's take a look at the smaller AlexNet, since it's expected to show the most differences.

Control group shows as expected almost no difference between different setups – recall, we use same OpenMP and precompiled MKL.

We are not able to explain the +20%/-10% swing of the both GCC compilers.

Same behaviour we see in the treatment group no matter which OpenMP is used.


 Control group

Control group


Treatment group shows no difference other than that "GCC-swing". Normalizing the data gives us average scores with ~1% difference, which is close to standard error.


 

Treatment group


ResNet152

Now we can observe a beautiful saturation of the throughput. Optimal batch size is between 16 and 32.

With resnet-152 we see no more interesting swings. The maximal difference to base line for a single batch size is ~7% in both groups, ~4% if averaged for all batch sizes. For example, we found Clang7 with native OMP to be 4% faster. Consider the standard error of 2%.

Very similar data we get for other models.

Total scores

Control group shows the following pretty close numbers:


ID

Score

Std.err

1

clang3_omp

1

0

2

clang7_omp

1.01157

0.02027

3

gcc5_omp

1.00581

0.01914

4

gcc8_omp

1.00795

0.0192

5

intel19_omp

1.0093

0.0192

Combining the treatment group with clang7_omp which is the best performer (again, with devastating margin of 1%) of the control group we have the following data.

 

ID

Score

Std. err

1

clang3_gnu

1

0

2

clang3_intel

1.00051

0.01739

3

clang7_gnu

1.014

0.02055

4

clang7_intel

1.01186

0.01899

5

gcc5_gnu

0.98937

0.01913

6

gcc5_intel

1.0083

0.01696

7

gcc8_gnu

0.98195

0.01961

8

gcc8_intel

1.00822

0.01723

9

intel19_intel

1.00486

0.01756

10

clang7_omp

1.01215

0.01777

We can see pretty obvious patterns.

  • Newer compilers perform better than the older.
  • GOMP is slower than IOMP.

But the overall differences are pretty close to standard error and don't even reach 2%.

faster-rcnn Benchmark

As we can see, GOMP delivers ~3-5% worse performance than OMP. 

Conclusion

We interpret the results as a suggestion, that the current state should be simplified. The benchmarking shows that we get at most 5% improvements vs worst case setup (e.g. older GCC). On the other side, we are explicitly discouraged by MKL maintainers to use this approach, as it can (and does) lead to hard-to-find issues.

Further tasks and open questions

  • Can we achieve better performance with GOMP using other environment variables?
  • Can we get more info about dominating factors (code quality vs OpenMP) with a profiler
  • Repeat the benchmarking on other instance types than c5.18xlarge
  • Include windows compilers

Acronyms

OMP - LLVM OpenMP implementation

IOMP - Intel OpenMP implementation

GOMP - GCC OpenMP implementation

ICC - Intel C Compiler

GCC - GNU C Compiler 

 

  • No labels