CI Runtime Improvements

This page tracks proposals from the community about how we can speed up test runs in our CI system. Speeding up test runs provides the community with a lot of value as it (1) lowers the cost of the CI system, and (2) decreases the amount of time it takes to get feedback on code changes. This page was created to discuss pros/cons to different test speed improvement approaches, and to capture different proposals from the community. This page also serves as a call-to-action to the community. If any community member is interested in devops or performance, and would like to help us research or implement on of these improvements please feel free. If any member would like to suggest a different approach for speeding up tests runs please add it to this page. Contributions from the community are welcome.

Current Testing Timings

Sample Run: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1705/pipeline/17

Time spent in Sanity Check: 1m47s

Time spent in Build: 29m43s

Time spent in Tests: 1h11m29s

Proposals for Speeding up Builds

Use ccache with nvcc

Builds are currently (as of Oct 4th, 2018) bottlenecked by Linux GPU compilation. Adding ccache with support for nvcc should dramatically reduce GPU Linux build times. Note: nvcc support is already present for ARM / Linux CPU builds which is why build times are as low as 50s.

Proposals for Speeding up Tests

Run some python tests in parallel

Many (but not all) Python tests can be run in parallel. Such tests could be annotated and then executed with nose's parallel test plugin. This should speed up test runs dramatically. For stability reasons, some tests will still be required to be run in sequence (for example non-idempotent tests). We will have to identify all of these tests, erring on the side of caution, and mark them as tests that cannot be run in parallel.

Move some tests to nightly

On a case-by-case basis, as approved by core developers/committers we could move some of the integration tests to nightly builds. This could include tests such as test_continuous_profile_and_instant_marker that currently take a lot of resources to run and are unlikely to break, but should be tested periodically to ensure compatibility. Ideally these tests will be replaced by faster running tests to maintain coverage.

Statically assert expected results rather than dynamically computing them in numpy/scipy

We currently have several long running numpy/scipy computations that are calculated each time tests are run, and then used as a basis to assert correct behavior of MXNet code. This was a good pattern in the past, but it causes issues when running some GPU tests. The result of this pattern is that GPU instances spend a lot of time running numpy calculations on their relatively slow CPUs in order to verify results. We should instead attempt to store the results of these calculations in a readable way, and assert on these stored results.

Run tests conditionally

It can be frustrating for developers to make changes that affect one part of the codebase (say documentation, or python) which then triggers a full regression test of the entire codebase. Ideally we could work backwards from code coverage reports and understand exactly which tests are required to ensure quality based on a given code change. This is difficult in MXNet with it's wide support of different languages. However, it is likely that some basic heuristic would allow us cut back on tests in many cases.

Page tree