Problem Statement

MXNet tries to leverage multithreading on CPUs and GPUs to solve many problems. Few big areas are: dependency engine to run operators in parallel, within operator implementation logic, for data loading using iterators. This designs helps MXNet achieve great performance, but adds some challenges with respect to usability. Below I demonstrate two scenarios where MXNet doesn't handle exceptions gracefully and causes the main thread to crash. 

Example 1

Code Block
import mxnet as mx
mx.nd.random_normal(0, -1, (2,3))


Code Block
terminate called after throwing an instance of 'dmlc::Error'
  what():  [02:32:04] ../src/engine/./threaded_engine.h:359: [02:32:04] ../src/operator/random/./sample_op.h:301: Check failed: param.scale > 0 (-1 vs. 0) scale parameter in gaussian has to be positive
Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7eff0140bf5b]
[bt] (1) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7eff0140c242]
[bt] (2) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7eff01a56c8a]
[bt] (3) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7eff01a4e8ca]
[bt] (4) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7eff01606165]
[bt] (5) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7eff03d1732c]
[bt] (6) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7eff03e691f6]
[bt] (7) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7eff03e712f5]
[bt] (8) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7eff03c731fc]
[bt] (9) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7eff03c90f33]
A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.
Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7eff0140bf5b]
[bt] (1) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7eff0140c242]
[bt] (2) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7eff03c7cb44]
[bt] (3) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7eff03c878c9]
[bt] (4) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7eff03c85774]
[bt] (5) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7eff03c8a424]
[bt] (6) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7eff03c900f3]
[bt] (7) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7eff03c90066]
[bt] (8) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7eff03c8fefa]
[bt] (9) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7eff03c8fe4a]


Example 2

Code Block
import mxnet as mx
data_path = 'manual_2.csv'
data_train = None
    data_train =, data_shape=(4,10),
    for batch in iter(data_train):
        print data_train.getdata().asnumpy()
except mx.base.MXNetError:
    print 'Exception handled'


Code Block
terminate called after throwing an instance of 'dmlc::Error'
  what():  [02:08:14] ../src/io/ Check failed: row.length == shape.Size() (4 vs. 40) The data size in CSV do not match size of shape: specified shape=[4,10], the csv row-length=4
Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7febfb693f5b]
[bt] (1) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7febfb694242]
[bt] (2) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7febfe0d9832]
[bt] (3) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7febfe0d9312]
[bt] (4) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7febfe0653f3]
[bt] (5) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7febfe04bf98]
[bt] (6) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7febfe053473]
[bt] (7) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7febfe053797]
[bt] (8) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7febfe0512eb]
[bt] (9) /home/ubuntu/sparse_support/mxnet/python/mxnet/../../build/ [0x7febfe05a188]


Why is this a problem ?


The above makes the usability experience really bad in a non-terminal environment like jupyter notebook, docker container.

Proper exception handling and propagation in MXNet is important for two types of use-case.  The first is for MXNet users who are using one of our APIs to build or test a model, and the second is MXNet service owners who are using MXNet in production for DL enabled services.

From the perspective of an MXNet user (especially a casual, or new user), the above makes for a poor user experience.  This poor user experience is worse in a non-terminal environment like jupyter notebook, docker container.  Even if the user understands the error, they're unable to respond in a high-level language like python because MXNet currently doesn't allow users to handle exceptions and exit gracefully (or to retry, or perform some other action).

From a service owner's perspective, when we don't properly propagate errors through our language bindings it becomes extremely difficult to debug and support our service.  As an example, crashing instead of propagating errors obviously has a negative affect on highly-available services.  Many services would page on-call staff when this occurs.  Additionally, although MXNet has a largely asynchronous API, it currently does not allow for services to handle exceptions on a per-request basis.  If we have one mis-shaped request in our queue to be processed, and 50 other in-flight requests, the single mis-shaped request will crash the entire process.  This currently requires services to implement logic to monitor and retry the other in-flight requests (which negatively affects latencies, and may in turn violate SLAs)Plus it doesn't allow users to handle exceptions and exit gracefully, retry or perform some other action..

We have had multiple customer requests to fix this:

Look at the community requests here:

Exception Handling for Iterators


MXNet uses a general IO processing pipeline based on ThreadedIter class in dmlc-core.


As mentioned above threaded iter provides support for producer consumer model where producer is a standalone thread. The PrefetchingIter uses the threaded iter and provides interface for data parsing and loading from custom format or record io format. This parsing and loading logic may in turn spawn multiple threads. Any exceptions thrown in these these in these threads should be caught and transported to the child thread where it will be rethrown. This rethrown exception will be caught and transported to the main thread.


The main thread maintains a queue of exception_ptrs and checks if the queue is non empty. If the queue is non empty it pulls out the exception ptrs and rethrows the exception.

Proof of Concept


Open Questions

1. Should we keep a queue of arbitary size or queue of size 1. One advantage of using queue of arbitary size is that we can store exceptions from multiple threads in the case where multiple threads throw exceptions. One disadvantage of using queue of arbitary size is that it consumes extra memory.

Exception Handling for Operators


  • Add exception_ptr member for ThreadedOpr opr_ex and exception_ptr member for ThreadedVar var_ex.
  • Put a try catch block in the ExecuteOprBlock around the execution of the operator. 
  • If there is an exeption thrown during the execution of the operator, then we intend to catch the exception and use the exception_ptr member for the ThreadedOpr to point to the exception object. We explicitly make a call to callback in this case. 
  • In the callback, we set the exception_ptr member for all the variables that the current operator will mutate.
  • In the callback we also set the exception_ptr member for the current operator to the one held by one of its dependencies. This way we can propagate an old exception_ptr down the dependency chain.
  • Also set the global_exc_ptr depending on whether there is exception associated with a read var.
  • In WaitForVar, check if the threaded_var->var_ex is set. If it is set, rethrow the exception. Since we are waiting for this var, if this var had an exception associated with it means somewhere in the dependency path to get to the var there was an exception thrown.
  • In WaitForAll, we can rethrow exception based on whether global_exc_ptr is set or not.

Proof of Concept for Approach 1


  • Add exception_ptr member for ThreadedOpr opr_ex and exception_ptr member for ThreadedVar var_ex.
  • Put a try catch block in the ExecuteOprBlock around execution of the operator. dont execute the operator if the threaded_opr already contains the exception.
  • Functions pushed using Push_async will take three parameters instead of two: on_start, execute, on_complete.
  • on_start callback will propagate exception_ptr based on whether read dependencies have exception_ptr associated with them.
  • on complete callback will propagate exception_ptr to write_vars based on whether the threaded_opr has exception associated with them.
  • The logic to rethrow the exception in WaitForVar and WaitForAll should be same as approach1.

Comparison of Approach1 and Approach2

Forces to execute operators even if prev operators failed. This can be a problem if subsequent operators after a failed operator throw exceptions other than dmlc::Error

Once there is a failed operator all the operators that depend on the current operator won't be executed.

Minimal api changes.The lambda closure expected by PushAsync has a different signature after adding onstart callback.
For the cases where exception is thrown, there is an overhead of execution of subsequent operators. For the cases where exception is thrown, there is no overhead of execution of subsequent operators.
Performance impact should be minimal in cases where there will be no exception thrown.Performance impact needs to be investigated because of additional overhead of the onstart callback even for cases where no exception thrown.


My recommendation is to take Approach1 since this introduces minimal api changes and also minimal performance impact in the case where no exception is thrown.


Since the performance impact of both Approaches should be similar and since Approach2 has an advantage of non execution of subsequent operators and addresses the issue with Approach1, the recommendation is to proceed with Approach2.

Open Questions

1. How to handle WaitForAll situation ?
