Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This page is a work in progress to document how MXNet operates at runtime. 

To get a first look at whats happens at runtime, we can start by leveraging the MXNet Profiler and looking at a sample model. 

This code creates a 3x3 tensor, computes the diagonal and then sum's along the diagonal (to compute the “trace”). Using the mxnet profiler, we capture internal MXNet behavior and dump it to a string and print it (“dumps()”) and also dump it to a file (“dump()”). Then we can import that file in Chrome and view it graphically.

Code Block
languagepy
import mxnet as mx
import numpy as np

from mxnet import profiler

#configure the profiler
profiler.set_config(profile_all=True, aggregate_stats=True, filename='trace_profile.json')
#start the profiler collecting data
profiler.set_state('run')

###########################################################
#1. create our data
data = np.linspace(1,9,9).reshape((3,3))

#2. create an MXNet ndarray
a = mx.nd.array(data)

#3. compute on our data and produce results
b = mx.nd.diag(a)
c = mx.nd.sum(b,-1)

#4. wait for computation to finish
mx.nd.waitall()
###########################################################

#stop the profiler
profiler.set_state('stop')

#dump the profiling data as a string
print(profiler.dumps())
#dump the profiling data as a json file that can be viewed graphically
profiler.dump()

When running this code, the dumps function dumps the profiling data to a string and returns it (which we promptly print). This statistical info is shown below.

Code Block
languagetext
Profile Statistics.
    Note that counter items are counter values and not time units.
Device Storage
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
Memory: cpu/0                           3           0.0520           0.0360           0.0520           0.0080

MXNET_C_API
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
MXNDArraySyncCopyFromCPU                1           0.1600           0.1600           0.1600           0.1600
MXNDArrayGetDType                       1           0.0010           0.0010           0.0010           0.0010
MXNet C API Calls                       8           0.0080           0.0010           0.0080           0.0035
MXImperativeInvokeEx                    2           0.2210           0.0720           0.1490           0.1105
MXNDArrayGetShape                       2           0.0020           0.0010           0.0010           0.0010
MXNet C API Concurrency                16           0.0000           0.0000           0.0010           0.0005
MXNDArrayWaitAll                        1          10.9030          10.9030          10.9030          10.9030
MXNDArrayCreateEx                       1           0.0200           0.0200           0.0200           0.0200

operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
sum                                     2           0.0530           0.0250           0.0280           0.0265
ResourceParallelRandomSetSeed           2           9.2730           4.6350           4.6380           4.6365
diag                                    2          12.8620           6.4300           6.4320           6.4310
WaitForVar                              2           0.0160           0.0060           0.0100           0.0080

The dump function writes out the same data in a different format to a file that can be opened in Chrome and displayed visually. This can be seen in the diagram below.

Image Added

The profiling data has captured info about interesting functions that have executed which your program was running. Here are some explanations about what each one does.

The functions in the C_API are:

MXImperativeInvokeEx - invokes an operator to perform the computation

MXNDArrayCreateEx - creates an ndarray

MXNDArrayGetDType - returns the data type of the ndarray

MXNDArrayGetShape - returns the shape of the ndarray (as a tuple where each element is the size of a dimension)

MXNDArraySyncCopyFromCPU - called when data is initially residing outside of an MXNet data structure (ie. numpy.ndarry rather than mxnet.numpy.ndarray). Data is copied into the mxnet data structure.

MXNDArrayWaitAll - wait for all asynchronous operations to finish in MXNet. This function is only used in benchmarking to wait for work to happen. In a real program there is no waiting, data dependencies are evaluated and computation executed as needed in a As Late As Possible (ALAP) way.

The function in the Engine API are:

WaitForVar - Takes a variable reference as input and waits until that variable has been computed before returning.

Other API functions:

ResourceParallelRandomSetSeed - sets the random number generator seed

Operators we intended to call in the code:

sum - sum a tensor along a particular axis

diag - compute the diagonal of the tensor


Whats happening here...

From the code, we can identify the major events in our test application

  1. Initialize our input data
  2. Creating a new MXNet ndarray using our existing data values
  3. Compute on our data
    1. produce the diagonal of the input data
    2. sum along the diagonal to compute the “trace” of the matrix
  4. wait for computation to finish (only needed when profiling)

Nothing MXNet related happens before #2, those calls are all regular numpy functions. When #2 happens and we create an MXNet ndarray quite a few things happen. The screenshot below shows a zoomed in portion of the timeline.

Image Added

Here, the four red arrows show the important events in this sequence.

  1. First, the MXNDArrayCreateEx is called to physically allocate space to store the data and other necessary attributes in the ndarray class
  2. Then some support functions are called (MXNDArrayGetShape, MXNDArrayGetDType) while initialing the data structure
  3. Finally the data is copied from the non-MXNet ndarray into the newly prepared MXNet ndarray by the MXNDArraySyncCopyFromCPU function

Next, #3 begins the computing process to produce our output data. The screenshot below shows this behavior.

Image Added

Here you can see that the following sequence of events happen:

  1. MXImperativeInvokeEx is called the first time to launch the diagonal operator from #3
  2. Soon after that the actual diag operator begins executing in another thread
  3. While that is happening, our main thread moves on and calls MXImperativeInvokeEx again to launch the sum operator. Just like before this returns without actually executing the operator and continues.
  4. Lastly, the MXNDArrayWaitAll is called as the main thread has progressed to #4 in our app. It will wait here while all the computation finishes

Next lets look at a view of the part of the timeline zoomed to the actual operator execution.

Image Added

Here there are 3 main events happening:

  1. The diag operator is executing first
  2. Then the ResourceParallelRandomSetSeed runs
  3. And finally the sum operator executes (for a very short time as shown by the big red arrow)

The diag operator running makes sense (although seems to take a little longer than we'd like). And at the end the sum operator runs (very quickly!). But the weird part in the middle is this ResourceParallelRandomSetSeed thing running. This is part of the MXNet resource manager. The resource manager handles temporary space and random number generators needed by the operators. The sum operator requests temporary space in order to compute the sum, and therefore launches the resource manager (for the first time) here. As part of its startup sequence the random number generator is initialized by setting the seed. So this is some initialization overhead. But lets try and run the app again, running the compute twice, and look at the 2nd run to try and remove this initialization from our profiling.

Heres the modified code:

Code Block
import mxnet as mx
import numpy as np

from mxnet import profiler

profiler.set_config(profile_all=True, aggregate_stats=True, filename='trace_profile.json')
profiler.set_state('run')

################
# first run
sdata = np.linspace(1,9,9).reshape((3,3))

sa = mx.nd.array(sdata)
sb = mx.nd.diag(sa)
sc = mx.nd.sum(sb,-1)

mx.nd.waitall()
################

################
# second run
data = np.linspace(1,9,9).reshape((3,3))

a = mx.nd.array(data)
b = mx.nd.diag(a)
c = mx.nd.sum(b,-1)

mx.nd.waitall()
################

profiler.set_state('stop')

print(profiler.dumps())
profiler.dump()

Notice that we renamed the variables and made another copy after the waitall call. This is so that MXNet doenst have to worry about re-using variables, and to segment the 2nd half after the first time initialization.

Heres an overview of the new timeline:

Image Added

The first red box is the first run, and the 2nd smaller one is the 2nd run. First off, we can see how much smaller the 2nd one is now without any of the initialization routines. Here is a zoomed in view of just the 2nd run. 

Image Added

We still have the same sequence of events at the beginning to initialize the MXNet ndarray (MXNDArrayCreateEx, MXNDArrayGetShape, MXNDArrayGetDType, MXNDArraySyncCopyFromCPU). Then the diag operator runs, followed by the sum operator, and finally the waitall. When you look at this, be careful about the assumptions that you make. In this version of the timeline it appears that the operator executes after the MXImperativeInvokeEx runs, and seems to imply an inherent ordering. But realize that there is no dependency between the diag operator finishing and the next MXImperativeInvokeEx launching the sum operator. In this case it just-so-happens that the diag operator finishes so quickly that it appears that way. But in reality the main thread is launching the operators and not waiting for them to finish. Lastly, keep in mind that in this case by the time we hit the MXNDArrayWaitAll everything is already done and we return immediately, but in other circumstances it may sit here waiting for everything to finish (like we saw earlier in the first run).

To summarize, we ran the MXNet profiler on a simple application and tried to understand the runtime behavior of MXNet. We were able to see some underlying functionality that manages operator execution and data allocation and some unexpected initialization routines.

Known issues:

Number of times an operator is counted as executed in the profiler is 2x the real value - https://github.com/apache/incubator-mxnet/issues/10520