Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This page tracks proposals from the community focusing on how we can speed up PR verifications in our Jenkins CI system.  Speeding up test runs provides the community with a lot of value as it (1) lowers the cost of the CI system, and (2) decreases the amount of time it takes to get feedback on code changes.  This page was created to discuss pros/cons to different test speed improvement approaches, and to capture different proposals from the community.  This page also serves as a call-to-action to the community.  If any community member is interested in devops or performance, and would like to help us research or implement on of these improvements please feel free.  If any member would like to suggest a different approach for speeding up tests runs please add it to this page.  Contributions from the community are welcome.

Current Testing Timings

Date: Oct 4, 2018

Sample Run: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1705/pipeline/17

...

Builds are currently (as of Oct 4th, 2018) bottlenecked by Linux GPU compilation.  Adding ccache with support for nvcc should dramatically reduce GPU Linux build times.  Note: nvcc support is already present for ARM / Linux CPU builds which is why build times are as low as 50s. This feature is currently WIP at https://github.com/apache/incubator-mxnet/pull/11520

Proposals for Speeding up Tests

...

Many (but not all) Python tests can be run in parallel.  Such tests could be annotated and then executed with nose's parallel test plugin. This should speed up test runs dramatically. For stability reasons, some tests will still be required to be run in sequence (for example non-idempotent tests). We will have to  identify all of these tests, erring on the side of caution, and mark them as tests that cannot be run in parallel.

...

It can be frustrating for developers to make changes that affect one part of the codebase (say documentation, or python) which then triggers trigger a full regression test of the entire codebase.  Ideally we could work backwards from code coverage reports and understand exactly which tests are required to ensure quality based on a given code change.  This is difficult in MXNet with it's its wide support of different languages.  However, it is likely that some basic heuristic would allow us cut back on tests in many cases.

Execute jobs in the correct stage

At the moment, various jobs mix their main task with the creation of prerequisites instead of separating concerns into different stages. Some examples are:

  • Doc generation: Compile documentation during build stage instead of publish stage. Duration has increased from 1min to 9min (critical path).
  • Scala/Julia: MXNet native library is compiled during test stage. Dependencies are downloaded every time. This adds about 5 minutes each.
  • R: Dependencies are downloaded every time. This adds about 8 minutes each.

This can be solved by installing dependencies in the Docker install stage (which we are caching) and precompiling during build stage. This is especially important because CPU heavy tasks should not be executed on GPU instances.

Speed up Windows slaves

Windows slaves have a high start up time (about 30 minutes) and are slower at executing tests. Python 3 GPU, for example, takes 28 minutes on Ubuntu while the raw execution time on Windows is 45 minutes. The former can be resolved by having a larger warm pool while the latter has to be investigated and might be a performance bottleneck that has to be investigated.