Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Builds are currently (as of Oct 4th, 2018) bottlenecked by Linux GPU compilation.  Adding ccache with support for nvcc should dramatically reduce GPU Linux build times.  Note: nvcc support is already present for ARM / Linux CPU builds which is why build times are as low as 50s. This feature is currently WIP at https://github.com/apache/incubator-mxnet/pull/11520

Proposals for Speeding up Tests

...

It can be frustrating for developers to make changes that affect one part of the codebase (say documentation, or python) which then triggers trigger a full regression test of the entire codebase.  Ideally we could work backwards from code coverage reports and understand exactly which tests are required to ensure quality based on a given code change.  This is difficult in MXNet with it's its wide support of different languages.  However, it is likely that some basic heuristic would allow us cut back on tests in many cases.

Execute jobs in the correct stage

At the moment, various jobs mix their main task with the creation of prerequisites instead of separating concerns into different stages. Some examples are:

  • Doc generation: Compile documentation during build stage instead of publish stage. Duration has increased from 1min to 9min (critical path).
  • Scala/Julia: MXNet native library is compiled during test stage. Dependencies are downloaded every time. This adds about 5 minutes each.
  • R: Dependencies are downloaded every time. This adds about 8 minutes each.

This can be solved by installing dependencies in the Docker install stage (which we are caching) and precompiling during build stage. This is especially important because CPU heavy tasks should not be executed on GPU instances.

Speed up Windows slaves

Windows slaves have a high start up time (about 30 minutes) and are slower at executing tests. Python 3 GPU, for example, takes 28 minutes on Ubuntu while the raw execution time on Windows is 45 minutes. The former can be resolved by having a larger warm pool while the latter has to be investigated and might be a performance bottleneck that has to be investigated.