MXNets Continuous Integration system is covering a big variety of environments with the help of Docker. This ensures consistent test behaviour and reproducibility in between multiple runs. This guide explains how to make use of the available tools to recreate test results on your local machine.
EC2 instances with automated setup
Set up your instance with the setup documented in MXNet Developer setup on AWS EC2
Then clone the MXNet repository and either use dev_menu.py for common usecases or continue with the instructions.
Requirements
In order to run this toolchain, the following packages have to be installed. Please note that CPU tests can be run on Mac OS and Ubuntu, while GPU tests may only be executed under Ubuntu. Unfortunately, Windows builds and tests are being done without Docker and are thus not covered by this guide.
- Docker
- Python3
- Optional: Nvidia-Docker (Ubuntu only, for GPU tests)
- Optional: GPU with Cuda Compute Capability ≥ 3.0
- Disk space: at least 100GB (150GB recommended)
- Code and Python dependencies, which are defined in ci/requirements.txt
pip3 install -r ci/requirements.txt
This part explains what commands to run in order to reproduce a failure at each stage.
Build
A build failure like shown below can be reproduced by copying the failed command, starting with ci/build.py
, and running it on your local machine while being in the root of your mxnet source directory. This step does NOT require a GPU, nor CUDA dependencies.
In this case, you would like to run ci/build.py --platform ubuntu_build_cuda /work/runtime_functions.sh build_ubuntu_gpu_cuda8_cudnn5
, which would produce an output like the following image:
Test
Reproducing test failures requires an additional step due to MXNet binaries not being present in your local workspace.
Dependencies
First we have to generate these dependecies before a test can be executed. These can be resolved by the stash commands, which are indicated by the message "Restore files previously stashed"
.
In this case, the stash is labelled as mkldnn_gpu. The easiest way to map this to a build-step, is by opening the Jenkinsfile and searching for pack_lib('mkldnn_gpu'
In this case, you will find a block like the following:
def compile_unix_mkldnn_gpu() {
return ['GPU: MKLDNN': {
node(NODE_LINUX_CPU) {
ws('workspace/build-mkldnn-gpu') {
timeout(time: max_time, unit: 'MINUTES') {
utils.init_git()
utils.docker_run('ubuntu_build_cuda', 'build_ubuntu_gpu_mkldnn', false)
utils.pack_lib('mkldnn_gpu', mx_mkldnn_lib, true)
}
}
}
}]
}
This means that the build-step you are looking for is called "GPU: MKLDNN"
. Now, please execute the steps described in the Build-Paragraph above before continuing.
Test execution
After the binaries have been generated successfully, please take the failed command from the screenshot above and execute it in the root of your MXNet workspace. In this case, you would like to run ci/build.py --nvidiadocker --platform ubuntu_gpu /work/runtime_functions.sh unittest_ubuntu_python2_gpu
. Please note the parameter --nvidiadocker
in this example. This indicates that this test requires a GPU and is thus only executable on a Ubuntu machine with Nvidia-Docker and a GPU installed. The result of this execution should look like follows:
Tips and Tricks
Repeating test execution
In order to test a test for it's robustness against flakiness, you might want to repeat the execution multiple times. This can be achieved with the MXNET_TEST_COUNT
environment variable. The execution would look like follows:
MXNET_TEST_COUNT=10000 nosetests --logging-level=DEBUG --verbose -s test_module.py:test_op3
Setting a fixed test seed
To reproduce a test failure caused by random data, you can use the MXNET_TEST_SEED environment variable.
MXNET_TEST_SEED=2096230603 nosetests --logging-level=DEBUG --verbose -s test_module.py:test_op3
Using the Flakiness Checker
Another way to accomplish the above is with the flakiness checker tool, which is currently located in the tools directory. This automatically sets the correct environment variables and infers the path.
Similar results to the above can be achieved using the following commands:
python tools/flakiness_checker.py test_module.test_op3
python tools/flakiness_checker.py test_module.test_op3 -s 2096230603
Usage documentation:
python tools/flakiness_checker.py [optional_arguments] <test-specifier>
where <test-specifier> is a string specifying which test to run. This can come in two formats:
- <file-name>.<test-name>, as is common in the github repository (e.g. test_example.test_flaky)
- <directory/<file>:<test-name>, like the input to nosetests (e.g. tests/python/unittest/test_example.py:test_flaky). Note: This directory can be either relative or absolute. Additionally, if the full path is not given, the script will search whatever directory is given for the provided file.
Optional Arguments:
-h, --help print built-in help message
-n N, --num-trials N run test for n trials, instead of the default of 10,000
-s SEED, --seed SEED use SEED as the test seed, rather than a random seed
Note: additional options will be added once the flaky test detector is deployed
Troubleshooting
In case you run into any issues, please try the following steps:
Cleaning the workspace (including subrepos, be careful with data loss)
ci/docker/runtime_functions.sh clean_repo
Using signal handler to get stack traces:
Use -DUSE_SIGNAL_HANDLER=ON and maybe also -DCMAKE_BUILD_TYPE=Debug as CMake arguments, you can edit ci/docker/runtime_functions.sh and change it to build with these options if they are not set.
Stepping into the container
It is possible to step into the container to run commands manually. In the output of the script, the docker command that is needed is printed which sets up all the needed docker options. You can replace the final script with /bin/bash or nothing to get a shell in the container.