Page History

...

Provide a generic utility for executing an operator benchmarks and performance tests.
1. This is responsible to creating input tensors of required shape on a given dtype, context.
2. Execute the provided operator - forward or forward + backward.
3. This generic utility will be integrated with MXNet profiler.
4. Captures the Capture profile output from MXNet profiler - time, memory.
5. Return a dictionary of results.
Input for the performance tests will be a key/value config.

...

Code Block

language	py

"""
MXNet operator performance benchmarks.

NOTE:
1. You can pass list of input dictionary to run benchmarks for an operator with different input configuration.
2. Results are dictionary of time, memory for the benchmark runs.
"""

# Run performance test for Add operator
results = run_performance_test(F=mx.nd.add, ctx=mx.cpu(), warmup=10, runs=50, inputs=[{"lhs": (1024, 1024),
                              												          "rhs": (1024, 1024),
                              												          "initializer": nd.normal,
                              												          "run_backward": True,
                              												          "dtype": "float32"}])

# Run performance test for Conv2D operator
results += run_performance_test(F=nn.gluon.Conv2D, ctx=mx.cpu(), warmup=10, runs=50, inputs = [{"data": (32, 3, 256, 256),
                              																  "data_initializer": nd.normal,
                              																  "channels": 64,
                              																  "kernel_size": (3, 3),
                              																  "strides": (1, 1),
                              																  "padding": (0, 0),
                              																  "dilation": (1, 1),
                              																  "layout": "NCHW",
                              																  "activation": None,
                              																  "run_backward": True,
                              																  "dtype": "float32"]}

...

How does the backend profiling utility code looks like?

Below we take an example of profiling Add operator.

Code Block

language	py

# Configurations
warmup = 25
runs = 50
run_backward = True

# Operator to benchmark
F = mx.nd.add

# Prepare data for the operator
lhs = mx.nd.ones(shape=(1024, 1024))
rhs = mx.nd.ones(shape=(1024, 1024))
lhs.attach_grad()
rhs.attach_grad()
mx.nd.waitall()

# Warmup
print("Warming up....")
for _ in range(warmup):
    with mx.autograd.record():
        res = mx.nd.add(lhs, rhs)
    res.backward()
    mx.nd.waitall()
print("Done warming up....")

# Run Performance Runs
print("Running performance runs....")
profiler.set_config(profile_all=True, aggregate_stats=True)
# Start Profiler
profiler.set_state('run')
for _ in range(runs):
    with mx.autograd.record():
        res = mx.nd.add(lhs1, rhs1)
    res.backward()
    mx.nd.waitall()

# Stop Profiler 
profiler.set_state('stop')

# Fetch Results from Profiler
# We will add a new API in Profiler - profiler.get_summary(reset=True)
# profiler.get_summary() => Return a JSON string representing the output as shown below.
#                        => Resets all the counter in the current profiler.

print("Done Running performance runs....")
print(profiler.dumps(reset=True))

Pros

No need to write 1 class per operator to set up a performance test. Whenever a new operator is created, developer needs to add a `run_performance_test(..)` line with a list of inputs to run performance tests. A generic utility will handle the execution.
Less code, easy to maintain.
More control for users - default inputs, random inputs, specific user defined inputs.
Deterministic and better suited for performance benchmarks, reproducibility and CI integration.
More accurate benchmark results - Time and Memory because we use MXNet profiler.
With Python interface:
1. Easy to maintain and develop.
2. Reflects the performance as seen by the users. (Majority users using Python interface)
3. Fastest way to get performance tests in place. We do not have any tests in place as of today.

Cons

Different operator will have different input names. For example, see above, add operator requires tensors with name lhs, rhs. However, Conv2D operator requires a tensor with data. The base performance executor utility will need to understand it and create tensors appropriately i.e., If it is one single executor, generalization across operator performance may make logic complex to manage.
Not easily extensible:
1. Hard to integrated with property based testing libraries like Hypothesis, to randomly generate test cases with different tensor shapes.

Addition of new Module

We propose to add this utility as a new module (opperf) under incubator-mxnet/benchmark as "incubator-mxnet/benchmark/opperf". Note that, this does not generate any user facing APIs, this is a utility under incubator-mxnet/benchmark folder for general use by community.

Addition of new API

We propose to add a new API to MXNet Profiler for easily fetching operator profile for processing programmatically.

1) mxnet.profiler.get_summary(reset=False)

Current Behavior:

Users can either use `mxnet.profiler.dump()` to output the profiler as a JSON file. Or, use `mxnet.profiler.dumps(reset=False)` API to print the summary on console.

Suggested Addition:

In order to enable easy programmatic usage of MXNet profiler output, we propose to introduce a new API to return the summary as JSON string. This enables users to run profiler, get summary output, perform analysis programmatically.

Code Block

language	py

mxnet.profiler.get_summary(reset=False)
	"""Gets current profiler summary as a JSON string. If reset is True, resets all the aggregate statistics collected up to this point i.e., it clears all    the profiler counters.
    
	Parameters:
    -----------
    reset: boolean, If True, resets all profiler statistics collected up to this point.
    """

Output:

We can visualize the output of this API as a JSON representation of the output from `mxnet.profiler.dumps(reset=False)` API as shown below.

However, please note that, below, Memory profile output is not the total bytes allocated. Current output from dumps is providing number of memory allocation calls made.

In the new suggested API, we will be adding additional Summary - Memory => Total Bytes Allocated (Per Device).

Image Added

...

No need to write 1 class per operator to set up a performance test. Whenever a new operator is created, developer needs to add a `run_performance_test(..)` line with a list of inputs to run performance tests. A generic utility will handle the execution.
Less code, easy to maintain.
More control for users - default inputs, random inputs, specific user defined inputs.
Deterministic and better suited for performance benchmarks, reproducibility and CI integration.
With Python interface:
1. Easy to maintain and develop.
2. Reflects the performance as seen by the users. (Majority users using Python interface)
3. Fastest way to get performance tests in place. We do not have any tests in place as of today.

Cons

Different operator will have different input names. For example, see above, add operator requires tensors with name lhs, rhs. However, Conv2D operator requires a tensor with data. The base performance executor utility will need to understand it and create tensors appropriately i.e., If it is one single executor, generalization across operator performance may make logic complex to manage.
Not easily extensible:
1. Hard to integrated with property based testing libraries like Hypothesis, to randomly generate test cases with different tensor shapes.
It is ideal to capture performance close to Kernel. Call from Python operator APIs may hide performance regression when operator computation is small.

Addition of new Module

We propose to add this utility as a new module (opperf) under incubator-mxnet/benchmark as "incubator-mxnet/benchmark/opperf". Note that, this does not generate any user facing APIs, this is a utility under incubator-mxnet/benchmark folder for general use by community.

API / User Experience

We can define 2 types of users of the library and describe API interface for each of these users.

...

~150 most commonly used operators will be tested on CPU(with and without MKL), GPU, FP32, FP64. See Appendix 1 for list of operators.
Operators will be tested with NDArray and Gluon interface only i.e., symbol interface is not used for testing owing to plans of deprecation.
Python interface is used - faster and get a check in place.along with MXNet profiler.
Time and Memory usage are Only timing is measured to start with.
Statistics - Mean of the metric.

...

Cover remaining operators left out from Phase 1.Support memory performance measurements.
Integrate with MXNet Profiler to capture - time, memory metrics.
Add more statistics - p50, p90, p99, min, max.

Phase 3

Explore and have CPP performance tests for most commonly used operators. This will give the true measurements compared to using Python Interface.
Integrate with property based testing libraries like Hypothesis, to randomly generate test cases with different tensor shapes and inputs.

...

Page tree

Versions Compared

Old Version 9

New Version Current

Key

How does the backend profiling utility code looks like?

Addition of new Module

Addition of new API

1) mxnet.profiler.get_summary(reset=False)

Addition of new Module

API / User Experience