Deep learning frameworks are core libraries that are increasingly used in a number of important real-world scenarios.  MXNet is no exception, being used  in range of production environments from embedded hardware, with very little RAM, to multi-million dollar web services running tens of thousands of requests per second.  MXNet is written in C++ and has a lot of raw pointer operations due to its high-performance, mathematical nature.  This introduces the potential for serious native coding errors to negatively affect important services and devices.  This document will describe how we're attempting to avoid these errors by introducing a heavily instrumented build (an ASAN build) that's designed to catch overflows and leaks in our automatic testing process.  The document will also describe how a developer can create an ASAN build and test for memory leaks locally when they're reported by users.



Background

ASAN

ASAN, or the address sanitizer, is one of many C++ sanitizers developed by Google with the primary initial goal of securing Chrome from use-after-free and buffer-overflow errors.  It was originally launched as a feature for clang, but is now available in recent versions of GCC.  ASAN on its own will detect:

  • Use after free errors
  • Heap buffer overflows
  • Stack buffer overflows
  • Global buffer overflows
  • Use after returns
  • Use after scopes
  • Initialization order bugs
  • Memory leaks

The advantage of ASAN compared to other similar tools such as Valgrind is that it's fast, and well supported.  This is also why it's the primary mechanism of detecting leaks for the communities that manage both Chrome and Firefox.   For more information on ASAN (and other sanitizer) basics and motivations check out this talk.

GCC versus Clang

The capabilities of ASAN are similar whether the compiler used is clang or g++, as long as you're using very recent version of the compilers.  There's a good comparison of clang versus gcc7, in terms of ASAN capabilities, here.  With MXNet we've tested various versions of compilers and found that clang does not work well with our library.  We haven't completely tracked down the issue, but trying various methods of enabling ASAN has not worked for us when using clang, including forcing clang to dynamically link the ASAN library.  Luckily GCC ASAN seems to work correctly, and GCC 8 has the ASAN capabilities that we'd like to use.  Because of this we recommend using GCC 8 w/ ASAN when attempting to detect leaks or buffer overflows in MXNet, and we use GCC8 in CI and the Dockerfiles referred to below.

ASAN versus Valgrind

ASAN is often compared to Valgrind.  Both perform similar actions and test for similar failures.  There are several differences, but one major one is that ASAN works by instrumenting the during compilation, Valgrind works by instrumenting binaries after compilation.  This design choice allows ASAN to run much faster than Valgrind, which makes it easier to run real world use cases.   Another major difference in the context of MXNet is that Valgrind does not support the F16C instruction, which is built be default with MXNet (meaning default builds are incompatible with Valgrind).  Additionally, Sanitizers generally have better multithreading support than Valgrind.  ASAN has wide community adoption, and several large technology communities are actively contributing to ASAN (Apple, Chrome, Firefox, Go-lang, LLVM).  One disadvantage to ASAN is that it's a relatively new tool compared to Valgrind, because of this it's best to use as up-to-date a version of ASAN as possible.

ASAN in CI

We're currently experimenting with a variety of methods for incorporating automatic ASAN checks in our CI system.  We will likely run these checks per-PR request while the detection is under active development.  We'll likely migrate the checks to nightly builds to reduce cost once the development is stable.  Thus far we've only enabled one variant of MXNet's build type, a basic CPU build.  We hope this will serve a basis for other developers to integrate ASAN with other MXNet build flavours.

CI Test: CPU ASAN - leak

MNIST C++ Training Test With Leak Detection

This CI task will inform users primarily about memory leaks.  To reduce output we are running a limited test, and are avoiding the use of python.  Currently this means we're running training and detection on a MNIST network based on MLPs.  To do this we're using an ASAN instrumented MXNet library and cpp package executable.  Failures for this CI task will be displayed in Jenkins logs, but will not fail the build.  The output is informational only, but could be used to help us bisect in the future (for example if we're investigating which commit may have introduced a new leak).

Example Output

...
+ ./mlp_cpu
[10:48:23] /work/mxnet/src/io/iter_mnist.cc:110: MNISTIter: load 60000 images, shuffle=1, shape=(100,784)
[10:48:24] /work/mxnet/src/io/iter_mnist.cc:110: MNISTIter: load 10000 images, shuffle=1, shape=(100,784)
[10:48:26] /work/mxnet/cpp-package/example/mlp_cpu.cpp:134: Epoch: 0 30241.9 samples/sec Accuracy: 0.1135
[10:48:28] /work/mxnet/cpp-package/example/mlp_cpu.cpp:134: Epoch: 1 32715.4 samples/sec Accuracy: 0.5632
[10:48:30] /work/mxnet/cpp-package/example/mlp_cpu.cpp:134: Epoch: 2 32573.3 samples/sec Accuracy: 0.8454
[10:48:32] /work/mxnet/cpp-package/example/mlp_cpu.cpp:134: Epoch: 3 32397.4 samples/sec Accuracy: 0.8792
[10:48:34] /work/mxnet/cpp-package/example/mlp_cpu.cpp:134: Epoch: 4 32362.5 samples/sec Accuracy: 0.9125
[10:48:36] /work/mxnet/cpp-package/example/mlp_cpu.cpp:134: Epoch: 5 32345 samples/sec Accuracy: 0.9245
[10:48:38] /work/mxnet/cpp-package/example/mlp_cpu.cpp:134: Epoch: 6 32591 samples/sec Accuracy: 0.9309
[10:48:40] /work/mxnet/cpp-package/example/mlp_cpu.cpp:134: Epoch: 7 32485.1 samples/sec Accuracy: 0.9362
[10:48:42] /work/mxnet/cpp-package/example/mlp_cpu.cpp:134: Epoch: 8 32591 samples/sec Accuracy: 0.9392
[10:48:44] /work/mxnet/cpp-package/example/mlp_cpu.cpp:134: Epoch: 9 32573.3 samples/sec Accuracy: 0.9417

=================================================================
==34==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 12560000 byte(s) in 80 object(s) allocated from:
    #0 0x7fedd84c7980 in __interceptor_posix_memalign (/usr/lib/x86_64-linux-gnu/libasan.so.5+0xee980)
    #1 0x159677c in mxnet::storage::CPUDeviceStorage::Alloc(unsigned long) (/work/mxnet/build/cpp-package/example/mlp_cpu+0x159677c)
    #2 0x15a4faa in mxnet::storage::NaiveStorageManager<mxnet::storage::CPUDeviceStorage>::Alloc(mxnet::Storage::Handle*) /work/mxnet/src/storage/./naive_storage_manager.h:61
    #3 0x15913c8 in mxnet::StorageImpl::Alloc(mxnet::Storage::Handle*) /work/mxnet/src/storage/storage.cc:147
    #4 0x14250eb in mxnet::Storage::Alloc(unsigned long, mxnet::Context) /work/mxnet/include/mxnet/./storage.h:70
    #5 0x14f631f in mxnet::NDArray::Chunk::CheckAndAlloc() /work/mxnet/include/mxnet/ndarray.h:882
    #6 0x166fe69 in mxnet::NDArray::Chunk::Chunk(nnvm::TShape, mxnet::Context, bool, int) /work/mxnet/include/mxnet/ndarray.h:784
    #7 0x167fa79 in void __gnu_cxx::new_allocator<mxnet::NDArray::Chunk>::construct<mxnet::NDArray::Chunk, nnvm::TShape const&, mxnet::Context&, bool&, int&>(mxnet::NDArray::Chunk*, nnvm::TShape const&, mxnet::Context&, bool&, int&) /usr/include/c++/8/ext/new_allocator.h:136
    #8 0x167f1cd in void std::allocator_traits<std::allocator<mxnet::NDArray::Chunk> >::construct<mxnet::NDArray::Chunk, nnvm::TShape const&, mxnet::Context&, bool&, int&>(std::allocator<mxnet::NDArray::Chunk>&, mxnet::NDArray::Chunk*, nnvm::TShape const&, mxnet::Context&, bool&, int&) /usr/include/c++/8/bits/alloc_traits.h:475
    #9 0x167e739 in std::_Sp_counted_ptr_inplace<mxnet::NDArray::Chunk, std::allocator<mxnet::NDArray::Chunk>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<nnvm::TShape const&, mxnet::Context&, bool&, int&>(std::allocator<mxnet::NDArray::Chunk>, nnvm::TShape const&, mxnet::Context&, bool&, int&) /usr/include/c++/8/bits/shared_ptr_base.h:549
    #10 0x167cdb8 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<mxnet::NDArray::Chunk, std::allocator<mxnet::NDArray::Chunk>, nnvm::TShape const&, mxnet::Context&, bool&, int&>(std::_Sp_make_shared_tag, mxnet::NDArray::Chunk*, std::allocator<mxnet::NDArray::Chunk> const&, nnvm::TShape const&, mxnet::Context&, bool&, int&) /usr/include/c++/8/bits/shared_ptr_base.h:662
    #11 0x167ab03 in std::__shared_ptr<mxnet::NDArray::Chunk, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<mxnet::NDArray::Chunk>, nnvm::TShape const&, mxnet::Context&, bool&, int&>(std::_Sp_make_shared_tag, std::allocator<mxnet::NDArray::Chunk> const&, nnvm::TShape const&, mxnet::Context&, bool&, int&) /usr/include/c++/8/bits/shared_ptr_base.h:1328
    #12 0x1677cdc in std::shared_ptr<mxnet::NDArray::Chunk>::shared_ptr<std::allocator<mxnet::NDArray::Chunk>, nnvm::TShape const&, mxnet::Context&, bool&, int&>(std::_Sp_make_shared_tag, std::allocator<mxnet::NDArray::Chunk> const&, nnvm::TShape const&, mxnet::Context&, bool&, int&) /usr/include/c++/8/bits/shared_ptr.h:360
    #13 0x1674016 in std::shared_ptr<mxnet::NDArray::Chunk> std::allocate_shared<mxnet::NDArray::Chunk, std::allocator<mxnet::NDArray::Chunk>, nnvm::TShape const&, mxnet::Context&, bool&, int&>(std::allocator<mxnet::NDArray::Chunk> const&, nnvm::TShape const&, mxnet::Context&, bool&, int&) /usr/include/c++/8/bits/shared_ptr.h:707
    #14 0x16717b1 in std::shared_ptr<mxnet::NDArray::Chunk> std::make_shared<mxnet::NDArray::Chunk, nnvm::TShape const&, mxnet::Context&, bool&, int&>(nnvm::TShape const&, mxnet::Context&, bool&, int&) /usr/include/c++/8/bits/shared_ptr.h:723
    #15 0x166f2ca in mxnet::NDArray::NDArray(nnvm::TShape const&, mxnet::Context, bool, int) /work/mxnet/include/mxnet/ndarray.h:98
    #16 0x68de1e6 in mxnet::io::PrefetcherIter::Init(std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&)::{lambda(mxnet::DataBatch**)#1}::operator()(mxnet::DataBatch**) const (/work/mxnet/build/cpp-package/example/mlp_cpu+0x68de1e6)
    #17 0x69102e5 in std::_Function_handler<bool (mxnet::DataBatch**), mxnet::io::PrefetcherIter::Init(std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&)::{lambda(mxnet::DataBatch**)#1}>::_M_invoke(std::_Any_data const&, mxnet::DataBatch**&&) /usr/include/c++/8/bits/std_function.h:282
    #18 0x6910bc2 in std::function<bool (mxnet::DataBatch**)>::operator()(mxnet::DataBatch**) const /usr/include/c++/8/bits/std_function.h:687
    #19 0x6907121 in dmlc::ThreadedIter<mxnet::DataBatch>::Init(std::function<bool (mxnet::DataBatch**)>, std::function<void ()>)::{lambda()#1}::operator()() const /work/mxnet/3rdparty/dmlc-core/include/dmlc/threadediter.h:357
    #20 0x6914e3c in void std::__invoke_impl<void, dmlc::ThreadedIter<mxnet::DataBatch>::Init(std::function<bool (mxnet::DataBatch**)>, std::function<void ()>)::{lambda()#1}>(std::__invoke_other, dmlc::ThreadedIter<mxnet::DataBatch>::Init(std::function<bool (mxnet::DataBatch**)>, std::function<void ()>)::{lambda()#1}&&) /usr/include/c++/8/bits/invoke.h:60
    #21 0x6910e12 in std::__invoke_result<dmlc::ThreadedIter<mxnet::DataBatch>::Init(std::function<bool (mxnet::DataBatch**)>, std::function<void ()>)::{lambda()#1}>::type std::__invoke<dmlc::ThreadedIter<mxnet::DataBatch>::Init(std::function<bool (mxnet::DataBatch**)>, std::function<void ()>)::{lambda()#1}>(std::__invoke_result&&, (dmlc::ThreadedIter<mxnet::DataBatch>::Init(std::function<bool (mxnet::DataBatch**)>, std::function<void ()>)::{lambda()#1}&&)...) /usr/include/c++/8/bits/invoke.h:95
    #22 0x691f21f in decltype (__invoke((_S_declval<0ul>)())) std::thread::_Invoker<std::tuple<dmlc::ThreadedIter<mxnet::DataBatch>::Init(std::function<bool (mxnet::DataBatch**)>, std::function<void ()>)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) /usr/include/c++/8/thread:234
    #23 0x691f1ac in std::thread::_Invoker<std::tuple<dmlc::ThreadedIter<mxnet::DataBatch>::Init(std::function<bool (mxnet::DataBatch**)>, std::function<void ()>)::{lambda()#1}> >::operator()() /usr/include/c++/8/thread:243
    #24 0x691ee15 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<dmlc::ThreadedIter<mxnet::DataBatch>::Init(std::function<bool (mxnet::DataBatch**)>, std::function<void ()>)::{lambda()#1}> > >::_M_run() /usr/include/c++/8/thread:186
    #25 0x7fedd5c5153e  (/usr/lib/x86_64-linux-gnu/libstdc++.so.6+0xbd53e)

...

Direct leak of 3560 byte(s) in 1 object(s) allocated from:
    #0 0x7fedd84c8970 in operator new[](unsigned long) (/usr/lib/x86_64-linux-gnu/libasan.so.5+0xef970)
    #1 0x1792b24 in mxnet::profiler::Profiler::Profiler() /work/mxnet/src/profiler/profiler.cc:70
    #2 0x17b1b20 in void __gnu_cxx::new_allocator<mxnet::profiler::Profiler>::construct<mxnet::profiler::Profiler>(mxnet::profiler::Profiler*) /usr/include/c++/8/ext/new_allocator.h:136
    #3 0x17b0bb4 in void std::allocator_traits<std::allocator<mxnet::profiler::Profiler> >::construct<mxnet::profiler::Profiler>(std::allocator<mxnet::profiler::Profiler>&, mxnet::profiler::Profiler*) /usr/include/c++/8/bits/alloc_traits.h:475
    #4 0x17af6f5 in std::_Sp_counted_ptr_inplace<mxnet::profiler::Profiler, std::allocator<mxnet::profiler::Profiler>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<>(std::allocator<mxnet::profiler::Profiler>) (/work/mxnet/build/cpp-package/example/mlp_cpu+0x17af6f5)
    #5 0x17ad0ca in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<mxnet::profiler::Profiler, std::allocator<mxnet::profiler::Profiler>>(std::_Sp_make_shared_tag, mxnet::profiler::Profiler*, std::allocator<mxnet::profiler::Profiler> const&) (/work/mxnet/build/cpp-package/example/mlp_cpu+0x17ad0ca)
    #6 0x17aa6bb in std::__shared_ptr<mxnet::profiler::Profiler, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<mxnet::profiler::Profiler>>(std::_Sp_make_shared_tag, std::allocator<mxnet::profiler::Profiler> const&) (/work/mxnet/build/cpp-package/example/mlp_cpu+0x17aa6bb)
    #7 0x17a70ab in std::shared_ptr<mxnet::profiler::Profiler>::shared_ptr<std::allocator<mxnet::profiler::Profiler>>(std::_Sp_make_shared_tag, std::allocator<mxnet::profiler::Profiler> const&) /usr/include/c++/8/bits/shared_ptr.h:360
    #8 0x17a4112 in std::shared_ptr<mxnet::profiler::Profiler> std::allocate_shared<mxnet::profiler::Profiler, std::allocator<mxnet::profiler::Profiler>>(std::allocator<mxnet::profiler::Profiler> const&) /usr/include/c++/8/bits/shared_ptr.h:707
    #9 0x17a091d in std::shared_ptr<mxnet::profiler::Profiler> std::make_shared<mxnet::profiler::Profiler>() /usr/include/c++/8/bits/shared_ptr.h:723
    #10 0x1793815 in mxnet::profiler::Profiler::Get(std::shared_ptr<mxnet::profiler::Profiler>*) /work/mxnet/src/profiler/profiler.cc:106
    #11 0x15d7a7c in mxnet::on_enter_api(char const*) /work/mxnet/src/c_api/c_api_profile.cc:123
    #12 0x1686f64 in MXListDataIters /work/mxnet/src/c_api/c_api.cc:683
    #13 0x13d9b28 in mxnet::cpp::MXDataIterMap::MXDataIterMap() (/work/mxnet/build/cpp-package/example/mlp_cpu+0x13d9b28)
    #14 0x13da6f2 in mxnet::cpp::MXDataIter::mxdataiter_map() (/work/mxnet/build/cpp-package/example/mlp_cpu+0x13da6f2)
    #15 0x13da868 in mxnet::cpp::MXDataIter::MXDataIter(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (/work/mxnet/build/cpp-package/example/mlp_cpu+0x13da868)
    #16 0x13b1e13 in main /work/mxnet/cpp-package/example/mlp_cpu.cpp:65
    #17 0x7fedd52c982f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)

...

SUMMARY: AddressSanitizer: 22682224 byte(s) leaked in 272 allocation(s).

...


CI Test: CPU ASAN - buffers

Python integration tests without leak detection

This CI task runs a large suite of python tests, but ignores memory leaks to simplify the ASAN output.  The purpose of this CI test is to detect serious memory errors such as buffer overflows, and it will fail when such errors are detected.

Example Output

This test is currently disabled.

Using ASAN builds with MXNet

CI reports are nice, but it's sometimes more useful to build an ASAN build locally and to run some specific sections of code you're afraid may be leaking.  This is easy to do with MXNet.  We've installed all the prerequisites in or CI build Dockerfiles, so we can use docker to build ASAN builds without having to install or configure dependencies.  To build a CPU build with ASAN run the following commands in a new folder:


git clone --recurse https://github.com/apache/incubator-mxnet.git
cd incubator-mxnet/ci
# Build our dockerfile with all required deps for ASAN
docker build -f docker/Dockerfile.build.ubuntu_cpu -t mxnetci/build.ubuntu_cpu docker
cd ..
mkdir -p build
# Build an ASAN instrumented MXNet library
# Privileged probably not required for all steps, but in general ASAN requires some capabilities in order to inspect process memory.
docker run --privileged -v `pwd`:/work/mxnet -v `pwd`/build:/work/build  -ti mxnetci/build.ubuntu_cpu /work/runtime_functions.sh build_ubuntu_cpu_cmake_asan

# Now choose an example of something you'd like to test with ASAN.  This could be a specific python test we want to run in a loop, it could be a C++ unit test written to expose leaks, etc.
# In our case we use a small Gluon python tests as an example.
# First we will enter our container, and then we'll run tests within the container.
docker run --privileged -v `pwd`:/work/mxnet -v `pwd`/build:/work/build  -ti mxnetci/build.ubuntu_cpu bash

# Now within the container:
export PYTHONPATH=./python/
export MXNET_MKLDNN_DEBUG=1  # Ignored if not present
export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
# Feel free to export any ASAN options via export ASAN_OPTIONS=...

# Importantly we need to make sure ASAN is the first library loaded (before our other libraries have a chance to allocate memory)
# To do this we add the library to the library preload list
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libasan.so.5
nosetests-3.4 --verbose tests/python/unittest/test_rnn.py


The output should look similar to:


root@e77a083c4d00:/work/mxnet# nosetests-3.4 --verbose tests/python/unittest/test_rnn.py
test_rnn.test_deprecated ... ok
test_rnn.test_rnn ... ok
test_rnn.test_lstm ... ok
test_rnn.test_lstm_forget_bias ... ok
test_rnn.test_gru ... ok
test_rnn.test_residual ... ok
test_rnn.test_residual_bidirectional ... ok
test_rnn.test_stack ... ok
test_rnn.test_bidirectional ... ok
test_rnn.test_zoneout ... ok
test_rnn.test_unfuse ... ok
test_rnn.test_convrnn ... ok
test_rnn.test_convlstm ... ok
test_rnn.test_convgru ... ok
test_rnn.test_encode_sentences ... ok

----------------------------------------------------------------------
Ran 15 tests in 0.253s

OK

=================================================================
==93==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 2414848 byte(s) in 572 object(s) allocated from:
    #0 0x7f7bacb38b60 in malloc (/usr/lib/x86_64-linux-gnu/libasan.so.5+0xedb60)
    #1 0x59c889  (/usr/bin/python3.5+0x59c889)

...

Direct leak of 1640 byte(s) in 1 object(s) allocated from:
    #0 0x7f7bacb3a970 in operator new[](unsigned long) (/usr/lib/x86_64-linux-gnu/libasan.so.5+0xef970)
    #1 0x7f7b53007bf0 in mxnet::profiler::Profiler::Profiler() /work/mxnet/src/profiler/profiler.cc:70
    #2 0x7f7b53028372 in void __gnu_cxx::new_allocator<mxnet::profiler::Profiler>::construct<mxnet::profiler::Profiler>(mxnet::profiler::Profiler*) (/work/mxnet/python/mxnet/../../build/libmxnet.so+0x68db372)
    #3 0x7f7b53027316 in void std::allocator_traits<std::allocator<mxnet::profiler::Profiler> >::construct<mxnet::profiler::Profiler>(std::allocator<mxnet::profiler::Profiler>&, mxnet::profiler::Profiler*) /usr/include/c++/8/bits/alloc_traits.h:475
    #4 0x7f7b53025b9d in std::_Sp_counted_ptr_inplace<mxnet::profiler::Profiler, std::allocator<mxnet::profiler::Profiler>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<>(std::allocator<mxnet::profiler::Profiler>) /usr/include/c++/8/bits/shared_ptr_base.h:549
    #5 0x7f7b530232fe in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<mxnet::profiler::Profiler, std::allocator<mxnet::profiler::Profiler>>(std::_Sp_make_shared_tag, mxnet::profiler::Profiler*, std::allocator<mxnet::profiler::Profiler> const&) /usr/include/c++/8/bits/shared_ptr_base.h:662
    #6 0x7f7b5302053f in std::__shared_ptr<mxnet::profiler::Profiler, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<mxnet::profiler::Profiler>>(std::_Sp_make_shared_tag, std::allocator<mxnet::profiler::Profiler> const&) /usr/include/c++/8/bits/shared_ptr_base.h:1328
    #7 0x7f7b5301cc03 in std::shared_ptr<mxnet::profiler::Profiler>::shared_ptr<std::allocator<mxnet::profiler::Profiler>>(std::_Sp_make_shared_tag, std::allocator<mxnet::profiler::Profiler> const&) /usr/include/c++/8/bits/shared_ptr.h:360
    #8 0x7f7b53019822 in std::shared_ptr<mxnet::profiler::Profiler> std::allocate_shared<mxnet::profiler::Profiler, std::allocator<mxnet::profiler::Profiler>>(std::allocator<mxnet::profiler::Profiler> const&) /usr/include/c++/8/bits/shared_ptr.h:707
    #9 0x7f7b53015d3b in std::shared_ptr<mxnet::profiler::Profiler> std::make_shared<mxnet::profiler::Profiler>() /usr/include/c++/8/bits/shared_ptr.h:723
    #10 0x7f7b530088e1 in mxnet::profiler::Profiler::Get(std::shared_ptr<mxnet::profiler::Profiler>*) /work/mxnet/src/profiler/profiler.cc:106
    #11 0x7f7b531d9778 in mxnet::engine::ThreadedEngine::ThreadedEngine() /work/mxnet/src/engine/./threaded_engine.h:310
    #12 0x7f7b531dc9ac in mxnet::engine::ThreadedEnginePerDevice::ThreadedEnginePerDevice() /work/mxnet/src/engine/threaded_engine_perdevice.cc:54
    #13 0x7f7b531d73a9 in mxnet::engine::CreateThreadedEnginePerDevice() /work/mxnet/src/engine/threaded_engine_perdevice.cc:342
    #14 0x7f7b5320633e in mxnet::engine::CreateEngine() /work/mxnet/src/engine/engine.cc:45
    #15 0x7f7b53205e4c in mxnet::Engine::_GetSharedRef() /work/mxnet/src/engine/engine.cc:62
    #16 0x7f7b53205fab in mxnet::Engine::Get() /work/mxnet/src/engine/engine.cc:67
    #17 0x7f7b5308422c in mxnet::LibraryInitializer::LibraryInitializer()::{lambda()#1}::operator()() const /work/mxnet/src/initialize.cc:54
    #18 0x7f7b5308428d in mxnet::LibraryInitializer::LibraryInitializer()::{lambda()#1}::_FUN() /work/mxnet/src/initialize.cc:55
    #19 0x7f7bac5303a4 in __fork (/lib/x86_64-linux-gnu/libc.so.6+0xcc3a4)
    #20 0x5e9abc  (/usr/bin/python3.5+0x5e9abc)
...

SUMMARY: AddressSanitizer: 7521184 byte(s) leaked in 3071 allocation(s).


Once the test is run, a developer will have to look at the stacks of memory allocations and determine which are important, and which are running as designed.  All of the reports in the sample output are displaying memory leak information, but other bugs will also be output and included in the summary if they are found.  If you are trying to reproduce a memory leak reported by users the recommend approach would be to reproduce the error by emulating the user's use case and running it in a loop with ASAN enabled.  You can then reduce the scope (for example to a C++ unit test) while continuing to run in a loop.  Eventually the stack traces should make it clear why the leak occurs.

MISC

Running in CLion

If you prefer debugging in an IDE, or you wish to break on ASAN errors you can use CLion.  You'll need to install all the required dependencies manually (but you can reference the CI Dockerfiles for help).  On Ubunut 16.04 this is as simple as installing gcc-8, and setting it as the project compiler in CLion.  You can then add the -DUSE_ASAN build flag for your project to enable ASAN support.  Finally you must add LD_PRELOAD to your run environment variables for your launch target and point it at the version of ASAN you have installed (libasan.so.5 for gcc8).

Disabling Memory Leak Detection

When ASAN builds are enabled we have leaks that are reported when running almost any MXNet test.  If you want to focus on the possibly more important memory errors such as buffer overflows, you can turn off leak detection by setting ASAN_OPTIONS=detect_leaks=0.

Other Sanitizers

After enabling and addressing issues reported by ASAN we can enable other sanitizers following the same template.  The two most applicable sanitizers are described below.

TSAN

TSAN is a sanitizer that detects data races and other thread-saftey errors in native libraries. TSAN works in a similar fashion to ASAN.  It instruments builds and surrounds memory with protect access buffers.  It then uses this instrumented code and specially protected buffers to ensures that each thread accesses memory in a threadsafe way.  TSAN supports C++11 atomics and other modern C++ features.  TSAN has more overhead (especially in memory usage) than ASAN.

MSAN

MSAN detects uninitialized memory accesses.  This could help us reduce errors in MXNet, especially difficult to reproduce, non-deterministic errors.  MSAN has a slowdown of roughly 3x when it instruments MXNet.


  • No labels

2 Comments

  1. Very useful feature for MXNet and very nicely written doc. suggest to send it out dev@

    I have a few questions

    1) Since the leaks are detected only when code is exercised, do you think we should have a nightly build that runs python examples against the ASAN binary, we could start collecting the data and analyze as much as we can..at least we'll know where we stand.

    2) in my previous life using Valgrind was a nightmare and threw many false positives, how is it with ASAN? 

  2. 1)  Yes we should run this nightly in the future.  The speed isn't actually that bad so we could even run it per PR while we work on these issues actively.  It would help us verify that a PR fixes a leak for example.

    2)  I agree Valgrind used to be not a great tool.  I've heard it's work better these days, but I've switched to ASAN and haven't looked back.  I'm sure there are the occasional ASAN false positives, but they are pretty quick in fixing them.  So far I haven't seen anything that looked unreasonable when run against MXNet code.