MXNet Runtime

This page is a work in progress to document how MXNet operates at runtime.

To get a first look at whats happens at runtime, we can start by leveraging the MXNet Profiler and looking at a sample model.

This code creates a 3x3 tensor, computes the diagonal and then sum's along the diagonal (to compute the “trace”). Using the mxnet profiler, we capture internal MXNet behavior and dump it to a string and print it (“dumps()”) and also dump it to a file (“dump()”). Then we can import that file in Chrome and view it graphically.

import mxnet as mx
import numpy as np

from mxnet import profiler

#configure the profiler
profiler.set_config(profile_all=True, aggregate_stats=True, filename='trace_profile.json')
#start the profiler collecting data
profiler.set_state('run')

###########################################################
#1. create our data
data = np.linspace(1,9,9).reshape((3,3))

#2. create an MXNet ndarray
a = mx.nd.array(data)

#3. compute on our data and produce results
b = mx.nd.diag(a)
c = mx.nd.sum(b,-1)

#4. wait for computation to finish
mx.nd.waitall()
###########################################################

#stop the profiler
profiler.set_state('stop')

#dump the profiling data as a string
print(profiler.dumps())
#dump the profiling data as a json file that can be viewed graphically
profiler.dump()

When running this code, the dumps function dumps the profiling data to a string and returns it (which we promptly print). This statistical info is shown below.

Profile Statistics.
    Note that counter items are counter values and not time units.
Device Storage
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
Memory: cpu/0                           3           0.0520           0.0360           0.0520           0.0080

MXNET_C_API
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
MXNDArraySyncCopyFromCPU                1           0.1600           0.1600           0.1600           0.1600
MXNDArrayGetDType                       1           0.0010           0.0010           0.0010           0.0010
MXNet C API Calls                       8           0.0080           0.0010           0.0080           0.0035
MXImperativeInvokeEx                    2           0.2210           0.0720           0.1490           0.1105
MXNDArrayGetShape                       2           0.0020           0.0010           0.0010           0.0010
MXNet C API Concurrency                16           0.0000           0.0000           0.0010           0.0005
MXNDArrayWaitAll                        1          10.9030          10.9030          10.9030          10.9030
MXNDArrayCreateEx                       1           0.0200           0.0200           0.0200           0.0200

operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
sum                                     2           0.0530           0.0250           0.0280           0.0265
ResourceParallelRandomSetSeed           2           9.2730           4.6350           4.6380           4.6365
diag                                    2          12.8620           6.4300           6.4320           6.4310
WaitForVar                              2           0.0160           0.0060           0.0100           0.0080

The dump function writes out the same data in a different format to a file that can be opened in Chrome and displayed visually. This can be seen in the diagram below.

The profiling data has captured info about interesting functions that have executed which your program was running. Here are some explanations about what each one does.

The functions in the C_API are:

Function Name	Description
MXImperativeInvokeEx	invokes an operator to perform the computation
MXNDArrayCreateEx	creates an ndarray
MXNDArrayGetDType	returns the data type of the ndarray
MXNDArrayGetShape	returns the shape of the ndarray (as a tuple where each element is the size of a dimension)
MXNDArraySyncCopyFromCPU	called when data is initially residing outside of an MXNet data structure (ie. numpy.ndarry rather than mxnet.numpy.ndarray). Data is copied into the mxnet data structure
MXNDArrayWaitAll	wait for all asynchronous operations to finish in MXNet. This function is only used in benchmarking to wait for work to happen. In a real program there is no waiting, data dependencies are evaluated and computation executed as needed in a As Late As Possible (ALAP) way

The function in the Engine API are:

Function Name	Description
WaitForVar	Takes a variable reference as input and waits until that variable has been computed before returning

Other API functions:

Function Name	Description
ResourceParallelRandomSetSeed	sets the random number generator seed

Operators we intended to call in the code:

Operator Name	Description
sum	sum a tensor along a particular axis
diag	compute the diagonal of the tensor

Whats happening here...

From the code, we can identify the major events in our test application

Initialize our input data
Creating a new MXNet ndarray using our existing data values
Compute on our data

produce the diagonal of the input data
sum along the diagonal to compute the “trace” of the matrix

wait for computation to finish (only needed when profiling)

Nothing MXNet related happens before #2, those calls are all regular numpy functions. When #2 happens and we create an MXNet ndarray quite a few things happen. The screenshot below shows a zoomed in portion of the timeline.

Here, the four red arrows show the important events in this sequence.

First, the MXNDArrayCreateEx is called to physically allocate space to store the data and other necessary attributes in the ndarray class
Then some support functions are called (MXNDArrayGetShape, MXNDArrayGetDType) while initialing the data structure
Finally the data is copied from the non-MXNet ndarray into the newly prepared MXNet ndarray by the MXNDArraySyncCopyFromCPU function

Next, #3 begins the computing process to produce our output data. The screenshot below shows this behavior.

Here you can see that the following sequence of events happen:

MXImperativeInvokeEx is called the first time to launch the diagonal operator from #3
Soon after that the actual diag operator begins executing in another thread
While that is happening, our main thread moves on and calls MXImperativeInvokeEx again to launch the sum operator. Just like before this returns without actually executing the operator and continues.
Lastly, the MXNDArrayWaitAll is called as the main thread has progressed to #4 in our app. It will wait here while all the computation finishes

Next lets look at a view of the part of the timeline zoomed to the actual operator execution.

Here there are 3 main events happening:

The diag operator is executing first
Then the ResourceParallelRandomSetSeed runs
And finally the sum operator executes (for a very short time as shown by the big red arrow)

The diag operator running makes sense (although seems to take a little longer than we'd like). And at the end the sum operator runs (very quickly!). But the weird part in the middle is this ResourceParallelRandomSetSeed thing running. This is part of the MXNet resource manager. The resource manager handles temporary space and random number generators needed by the operators. The sum operator requests temporary space in order to compute the sum, and therefore launches the resource manager (for the first time) here. As part of its startup sequence the random number generator is initialized by setting the seed. So this is some initialization overhead. But lets try and run the app again, running the compute twice, and look at the 2nd run to try and remove this initialization from our profiling.

Heres the modified code:

import mxnet as mx
import numpy as np

from mxnet import profiler

profiler.set_config(profile_all=True, aggregate_stats=True, filename='trace_profile.json')
profiler.set_state('run')

################
# first run
sdata = np.linspace(1,9,9).reshape((3,3))

sa = mx.nd.array(sdata)
sb = mx.nd.diag(sa)
sc = mx.nd.sum(sb,-1)

mx.nd.waitall()
################

################
# second run
data = np.linspace(1,9,9).reshape((3,3))

a = mx.nd.array(data)
b = mx.nd.diag(a)
c = mx.nd.sum(b,-1)

mx.nd.waitall()
################

profiler.set_state('stop')

print(profiler.dumps())
profiler.dump()

Notice that we renamed the variables and made another copy after the waitall call. This is so that MXNet doenst have to worry about re-using variables, and to segment the 2nd half after the first time initialization.

Heres an overview of the new timeline:

The first red box is the first run, and the 2nd smaller one is the 2nd run. First off, we can see how much smaller the 2nd one is now without any of the initialization routines. Here is a zoomed in view of just the 2nd run.

We still have the same sequence of events at the beginning to initialize the MXNet ndarray (MXNDArrayCreateEx, MXNDArrayGetShape, MXNDArrayGetDType, MXNDArraySyncCopyFromCPU). Then the diag operator runs, followed by the sum operator, and finally the waitall. When you look at this, be careful about the assumptions that you make. In this version of the timeline it appears that the operator executes after the MXImperativeInvokeEx runs, and seems to imply an inherent ordering. But realize that there is no dependency between the diag operator finishing and the next MXImperativeInvokeEx launching the sum operator. In this case it just-so-happens that the diag operator finishes so quickly that it appears that way. But in reality the main thread is launching the operators and not waiting for them to finish. Lastly, keep in mind that in this case by the time we hit the MXNDArrayWaitAll everything is already done and we return immediately, but in other circumstances it may sit here waiting for everything to finish (like we saw earlier in the first run).

Summary

We ran the MXNet profiler on a simple application and tried to understand the runtime behavior of MXNet. We were able to see some underlying functionality that manages operator execution and data allocation and some unexpected initialization routines.

Known issues:

Number of times an operator is counted as executed in the profiler is 2x the real value - https://github.com/apache/incubator-mxnet/issues/10520

Page tree