Fixing Flaky Tests

Note: the references to nose testing tool is outdated as the community switched to pytest for testing. See development guide.

This page is to provide some tips and tricks for fixing flaky tests. Flaky tests are defined as tests that fail intermittently on CI builds and they may indicate stability problems or improper handling of edge cases.

A Test Failure Log

Each and every flaky test occurrence should come with an error log, like this one:

======================================================================

FAIL: test_operator_gpu.test_sparse_dot

----------------------------------------------------------------------

Traceback (most recent call last):

File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest

self.test(*self.arg)

File "/usr/local/lib/python3.5/dist-packages/nose/util.py", line 620, in newfunc

return func(*arg, **kw)

File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 157, in test_new

orig_test(*args, **kwargs)

File "/work/mxnet/tests/python/gpu/../unittest/test_sparse_operator.py", line 1343, in test_sparse_dot

lhs_d, rhs_d, False, True)

File "/work/mxnet/tests/python/gpu/../unittest/test_sparse_operator.py", line 1237, in test_infer_forward_stype

assert_almost_equal(out.tostype('default').asnumpy(), out_np, rtol=1e-4, atol=1e-5)

File "/work/mxnet/python/mxnet/test_utils.py", line 493, in assert_almost_equal

raise AssertionError(msg)

AssertionError:

Items are not equal:

Error 1.067252 exceeds tolerance rtol=0.000100, atol=0.000010. Location of maximum error:(34, 15), a=0.022217, b=0.022204

a: array([[ -9.000863 , 4.9705057 , -2.7022123 , ..., -10.717851 , 19.614717 , 17.951117 ],

[-19.35049 , 2.4999516 , -7.9741106 , ..., 15.310856 ,...

b: array([[ -9.000862 , 4.9705014 , -2.7022119 , ..., -10.717848 , 19.614723 , 17.951107 ],

[-19.350492 , 2.4999518 , -7.9741096 , ..., 15.310851 ,...

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=578250488 to reproduce.

--------------------- >> end captured logging << ---------------------

The essential info in this log are the test name, the random seed used for this trial, the calculated error, and the tolerance levels used. Other info that may be useful include the line of code the test failed at, and position of the maximum error.

How to Reproduce Test Failures

Always make sure you're able to reproduce the test failure using the random seed and environment info before you jump to a fix. This part is covered in Reproducing test results.

How to Find Root Cause of Flakiness

Possible Causes of Flakiness

Usually, flaky tests in MXNet are cause by 3 reasons:

Improper handling of edge cases revealed by random testing
Improper settings of tolerance levels
Possible race conditions in the code

Some flaky tests could also be caused by other reasons, but the 3 reasons above cover nearly all the cases that the author has previously met. So attributing the root cause to those 3 reasons is the recommended first step of finding root causes of flakiness.

Find the root cause of Flakiness

As said above, the very first thing to do for this stage is to try your best to attribute the cause to the 3 major reasons. Then you should move on to analyze the essential info from the test log. Usually if the error value is small (close to 1) and the tolerance levels are also very small, the problem is with the tolerance level settings. If the error is very high, it may indicate some problem with the implementations. On the other hand, if you're able to consistently re-produce the same error with the same seed, it's unlikely to be caused by a race condition problem. Otherwise, the root cause may be race conditions in the code.

Let's take the log above as an example, we can see that the error is relatively small (close to 1), and the tolerance levels are quite small at the same time, which means the difference between actual and expected values is small and only exceeding the the allowed values by a very small amount. So for this one we can conclude that the root cause should be improper settings of tolerance levels, a quick fix can be done by bumping up the tolerance levels. The PR #12527 fixed the problem by bumping up the tolerance.

How to Fix Flakiness in tests

Before moving on to actual fixes, you would like to make sure that you have successfully re-produced the error and identified the root cause according to the previous sections.

Here we provide tips for the 3 major root causes, flakiness caused by other reasons should be handled in a case-by-case fashion:

For tests caused by un-handled edge cases, make sure you not only fix the case that's causing the test failure, but also check for other possible edge cases and add test coverage for the same component.
For tests caused by improper tolerance level, simply increase the tolerance level to a reasonable value: e.g. if 1e-5 is too small, try with 5e-5 or 1e-4 first, do not go to 1e-3 directly.
For tests caused by race conditions, only add the necessary synchronizations so that the impact to performance is reduced.

After you have made the necessary changes, please make sure that you:

Re-compile MXNet if necessary and verify the fix with the corresponding environment by running the same tests for more than 10000 passes (now can be done more easily with Automated Flaky Test Detector)
Submit a PR according to Git Setup and Workflow and/or other instructions on how to contribute to MXNet.
Have proper title for your PR and remember to refer to the tracking issue for the flaky test in your PR
Address any code review comments on your PR

After Fix is Merged

After you have addressed all comments on your PR, it should be good for merge. Once it's merged, make sure you also check if the related Github issue is closed so that we have accurate tracking. Flakiness may still exist even after a fix is delivered as 10000 trials may still not be enough to cover each and every single random seed. So please stay alert for future occurrences.

Appendix I: Location of test code

Logs usually comes with the test name like "test_xxx.test_yyy" where the "test_xxx" part is the name of the test file, and the "test_yyy" is the name of the actual test. All test files are located under "tests/python/" folder. Since there are some test files that also import tests from some other test files for testing in specific environments, such as test_operator_gpu.py, there's the possibility of not being able to find a certain test in a test file. Under such case, please do a search within "tests/python/" to find where the actual test code is at if you want to make certain changes for debugging purpose or for fixing the flakiness.

Appendix II: Recommended Tolerance Levels

The recommended guideline depends on the precision used in the tests:

float16: atol 1e-2, rtol 1e-4
float32: atol 1e-3, rtol 1e-5
float64: atol 1e-3, rtol 1e-5

Please double-check for correctness of the implementation or the test if you have to use a higher tolerance level than above to make the test pass more than 10000 trials, but these values should be treated as a guideline rather than a standard as some operators/components are prone to larger errors.

Appendix III: How Errors are Calculated

error = |expected - actual| / (rtol*|expected| + atol)

Page tree