Page History

...

Provide a generic utility for executing an operator benchmarks and performance tests.
1. This is responsible to creating input tensors of required shape on a given dtype, context.
2. Execute the provided operator - forward or forward + backward.
3. This generic utility will be integrated with MXNet profiler.
4. Captures the Capture profile output from MXNet profiler - time, memory.
5. Return a dictionary of results.
Input for the performance tests will be a key/value config.

...

Code Block

language	py

"""
MXNet operator performance benchmarks.

NOTE:
1. You can pass list of input dictionary to run benchmarks for an operator with different input configuration.
2. Results are dictionary of time, memory for the benchmark runs.
"""

# Run performance test for Add operator
results = run_performance_test(F=mx.nd.add, ctx=mx.cpu(), warmup=10, runs=50, inputs=[{"lhs": (1024, 1024),
                              												          "rhs": (1024, 1024),
                              												          "initializer": nd.normal,
                              												          "run_backward": True,
                              												          "dtype": "float32"}])

# Run performance test for Conv2D operator
results += run_performance_test(F=nn.gluon.Conv2D, ctx=mx.cpu(), warmup=10, runs=50, inputs = [{"data": (32, 3, 256, 256),
                              																  "data_initializer": nd.normal,
                              																  "channels": 64,
                              																  "kernel_size": (3, 3),
                              																  "strides": (1, 1),
                              																  "padding": (0, 0),
                              																  "dilation": (1, 1),
                              																  "layout": "NCHW",
                              																  "activation": None,
                              																  "run_backward": True,
                              																  "dtype": "float32"]}

...

No need to write 1 class per operator to set up a performance test. Whenever a new operator is created, developer needs to add a `run_performance_test(..)` line with a list of inputs to run performance tests. A generic utility will handle the execution.
Less code, easy to maintain.
More control for users - default inputs, random inputs, specific user defined inputs.
Deterministic and better suited for performance benchmarks, reproducibility and CI integration.
With Python interface:
1. Easy to maintain and develop.
2. Reflects the performance as seen by the users. (Majority users using Python interface)
3. Fastest way to get performance tests in place. We do not have any tests in place as of today.

Cons

Different operator will have different input names. For example, see above, add operator requires tensors with name lhs, rhs. However, Conv2D operator requires a tensor with data. The base performance executor utility will need to understand it and create tensors appropriately i.e., If it is one single executor, generalization across operator performance may make logic complex to manage.
Not easily extensible:
1. Hard to integrated with property based testing libraries like Hypothesis, to randomly generate test cases with different tensor shapes.
It is ideal to capture performance close to Kernel. Call from Python operator APIs may hide performance regression when operator computation is small.

Addition of new Module

We propose to add this utility as a new module (opperf) under incubator-mxnet/benchmark as "incubator-mxnet/benchmark/opperf". Note that, this does not generate any user facing APIs, this is a utility under incubator-mxnet/benchmark folder for general use by community.

API / User Experience

We can define 2 types of users of the library and describe API interface for each of these users.

How does the backend profiling utility code looks like?

Below we take an example of profiling Add operator.

Code Block

language	py

# Configurations
warmup = 25
runs = 50
run_backward = True

# Operator to benchmark
F = mx.nd.add

# Prepare data for the operator
lhs = mx.nd.ones(shape=(1024, 1024))
rhs = mx.nd.ones(shape=(1024, 1024))
lhs.attach_grad()
rhs.attach_grad()
mx.nd.waitall()

# Warmup
print("Warming up....")
for _ in range(warmup):
    with mx.autograd.record():
        res = mx.nd.add(lhs, rhs)
    res.backward()
    mx.nd.waitall()
print("Done warming up....")

# Run Performance Runs
print("Running performance runs....")
profiler.set_config(profile_all=True, aggregate_stats=True)
# Start Profiler
profiler.set_state('run')
for _ in range(runs):
    with mx.autograd.record():
        res = mx.nd.add(lhs1, rhs1)
    res.backward()
    mx.nd.waitall()

# Stop Profiler 
profiler.set_state('stop')

# Fetch Results from Profiler
# We will add a new API in Profiler - profiler.get_summary(reset=True)
# profiler.get_summary() => Return a JSON string representing the output as shown below.
#                        => Resets all the counter in the current profiler.

print("Done Running performance runs....")
print(profiler.dumps(reset=True))

Pros

No need to write 1 class per operator to set up a performance test. Whenever a new operator is created, developer needs to add a `run_performance_test(..)` line with a list of inputs to run performance tests. A generic utility will handle the execution.
Less code, easy to maintain.
More control for users - default inputs, random inputs, specific user defined inputs.
Deterministic and better suited for performance benchmarks, reproducibility and CI integration.
More accurate benchmark results - Time and Memory because we use MXNet profiler.
With Python interface:
1. Easy to maintain and develop.
2. Reflects the performance as seen by the users. (Majority users using Python interface)
3. Fastest way to get performance tests in place. We do not have any tests in place as of today.

Cons

Different operator will have different input names. For example, see above, add operator requires tensors with name lhs, rhs. However, Conv2D operator requires a tensor with data. The base performance executor utility will need to understand it and create tensors appropriately i.e., If it is one single executor, generalization across operator performance may make logic complex to manage.
Not easily extensible:
1. Hard to integrated with property based testing libraries like Hypothesis, to randomly generate test cases with different tensor shapes.

Addition of new Module

We propose to add this utility as a new module (opperf) under incubator-mxnet/benchmark as "incubator-mxnet/benchmark/opperf". Note that, this does not generate any user facing APIs, this is a utility under incubator-mxnet/benchmark folder for general use by community.

Addition of new API

We propose to add a new API to MXNet Profiler for easily fetching operator profile for processing programmatically.

1) mxnet.profiler.get_summary(reset=False)

Current Behavior:

Users can either use `mxnet.profiler.dump()` to output the profiler as a JSON file. Or, use `mxnet.profiler.dumps(reset=False)` API to print the summary on console.

Suggested Addition:

In order to enable easy programmatic usage of MXNet profiler output, we propose to introduce a new API to return the summary as JSON string. This enables users to run profiler, get summary output, perform analysis programmatically.

Code Block

language	py

mxnet.profiler.get_summary(reset=False)
	"""Gets current profiler summary as a JSON string. If reset is True, resets all the aggregate statistics collected up to this point i.e., it clears all    the profiler counters.
    
	Parameters:
    -----------
    reset: boolean, If True, resets all profiler statistics collected up to this point.
    """

Output:

We can visualize the output of this API as a JSON representation of the output from `mxnet.profiler.dumps(reset=False)` API as shown below.

However, please note that, below, Memory profile output is not the total bytes allocated. Current output from dumps is providing number of memory allocation calls made.

In the new suggested API, we will be adding additional Summary - Memory => Total Bytes Allocated (Per Device).

Image Added

API / User Experience

We can define 2 types of users of the library and describe API interface for each of these users.

General User, Automated Nightly tests

Run benchmarks on all the operators or on specific categories of operators. Use default inputs provided by the library.

Power User, PR validation tests

Run benchmark with customized Inputs

Use Case 1 - Run benchmarks for all the operators

A driver to run all the MXNet operators (NDArray and Gluon) benchmarks with default inputs and saves the final result as JSON in the provided file.

Code Block
python incubator-mxnet/benchmark/opperf/run_all_mxnet_operator_benchmarks.py --output-format json --output-file mxnet_operator_benchmark_results.json

Other Driver Script CLI Options:

output-format : json or md for markdown file output or csv.
ctx : By default, cpu on CPU machine, gpu(0) on GPU machine. You can override and set the global context for all operator benchmarks. Example: --ctx gpu(2).
dtype : By default, float32. You can override and set the global dtype for all operator benchmarks. Example: --dtype float64.

Output for the above benchmark run, on a CPU machine, would look something like below:

Code Block

{
    "MX_Multiply_Forward_Backward_Time": 0.025911798477172853,
    "MX_Gluon_Imperative_RNN_Forward_Backward_Time": 0.011011338233947754,
    "MX_Gluon_Imperative_MaxPool2D_Forward_Backward_Time": 0.1580966854095459,
    "MX_Gluon_Imperative_Conv1D_Forward_Backward_Time": 0.03413449287414551,
    "MX_Ones_Forward_Time": 0.002405076026916504,
    "MX_Modulo_Forward_Backward_Time": 0.049943366050720216,
    "MX_Subtract_Forward_Backward_Time": 0.01635995864868164,
    "MX_ArgMin_Forward_Backward_Time": 0.01545732021331787,
    "MX_Logical_Xor_Forward_Backward_Time": 0.018084139823913575,
    "MX_Zeros_Like_Forward_Time": 0.0027973604202270507,
    "MX_Inplace_Multiply_Forward_Time": 0.005555639266967774,
    "MX_ArgSort_Forward_Time": 0.13972537994384765,
    "MX_Arange_Forward_Time": 0.00010946273803710938,
........
........
}

Use Case 2 - Power user - Run benchmarks for specific operator

As a power user, let us assume, you want to run benchmarks on Add operator with on a float64 tensor instead of a default float32.
NOTE: Similarly, you could also specify the input tensors to use for benchmarking.

Use Case 2.1 - Customize Inputs for Operators

Code Block
results = run_performance_test(F=mx.nd.add, ctx=mx.cpu(), warmup=10, runs=50, inputs=[{"lhs": (1024, 1024), "rhs": (1024, 1024), "initializer": nd.normal, "run_backward": True, "dtype": "float64"}])

Output for the above benchmark run, on a CPU machine, would look something like below:

Code Block
MX_Add_Forward_Backward_Time - 0.025401 seconds

Use Case 3 - Nightly CI Tests

We will maintain a JSON file of expected performance for each operator under "incubator-mxnet/benchmark/opperf".
These expected results are captured on different configuration such as - FP32/64/16, MKL, No MKL, CUDA10, instances (c5.16x, p3.8x).
Runs all the operator performance runs and gets the results JSON.
Compares with the expected results +/- % threshold.

Development Plan / Milestones

Phase 1

~150 most commonly used operators will be tested on CPU(with and without MKL), GPU, FP32, FP64. See Appendix 1 for list of operators.
Operators will be tested with NDArray and Gluon interface only i.e., symbol interface is not used for testing owing to plans of deprecation.
Python interface is used along with MXNet profiler.
Time and Memory usage are measured to start with.
Statistics - Mean of the metric.

Phase 2

Cover remaining operators left out from Phase 1.
Add more statistics - p50, p90, p99, min, max.

Phase 3

Explore and have CPP performance tests for most commonly used operators. This will give the true measurements compared to using Python Interface.
Integrate with property based testing libraries like Hypothesis, to randomly generate test cases with different tensor shapes and inputs.

Current Status

See this repo for more details - https://github.com/sandeep-krishnamurthy/dl-operator-benchmark

134 operators are supported:

All Gluon Layers - Activation, Loss, Normalization, Basic like Dense, Convolutions, Recurrent (RNN, LSTM, GRU)
NDArray operators like creation, random sampling, arithmetic, logical, comparison etc...

Able to run individual operator benchmarks or use high level drivers to run all tests.
Able to generate results as JSON.
Timing metric - forward only, forward+backward operation

General User, Automated Nightly tests

Run benchmarks on all the operators or on specific categories of operators. Use default inputs provided by the library.

Power User, PR validation tests

Run benchmark with customized Inputs

Use Case 1 - Run benchmarks for all the operators

A driver to run all the MXNet operators (NDArray and Gluon) benchmarks with default inputs and saves the final result as JSON in the provided file.

Code Block
python incubator-mxnet/benchmark/opperf/run_all_mxnet_operator_benchmarks.py --output-format json --output-file mxnet_operator_benchmark_results.json

Other Driver Script CLI Options:

output-format : json or md for markdown file output or csv.
ctx : By default, cpu on CPU machine, gpu(0) on GPU machine. You can override and set the global context for all operator benchmarks. Example: --ctx gpu(2).
dtype : By default, float32. You can override and set the global dtype for all operator benchmarks. Example: --dtype float64.

Output for the above benchmark run, on a CPU machine, would look something like below:

Code Block

{
    "MX_Multiply_Forward_Backward_Time": 0.025911798477172853,
    "MX_Gluon_Imperative_RNN_Forward_Backward_Time": 0.011011338233947754,
    "MX_Gluon_Imperative_MaxPool2D_Forward_Backward_Time": 0.1580966854095459,
    "MX_Gluon_Imperative_Conv1D_Forward_Backward_Time": 0.03413449287414551,
    "MX_Ones_Forward_Time": 0.002405076026916504,
    "MX_Modulo_Forward_Backward_Time": 0.049943366050720216,
    "MX_Subtract_Forward_Backward_Time": 0.01635995864868164,
    "MX_ArgMin_Forward_Backward_Time": 0.01545732021331787,
    "MX_Logical_Xor_Forward_Backward_Time": 0.018084139823913575,
    "MX_Zeros_Like_Forward_Time": 0.0027973604202270507,
    "MX_Inplace_Multiply_Forward_Time": 0.005555639266967774,
    "MX_ArgSort_Forward_Time": 0.13972537994384765,
    "MX_Arange_Forward_Time": 0.00010946273803710938,
........
........
}

Use Case 2 - Power user - Run benchmarks for specific operator

As a power user, let us assume, you want to run benchmarks on Add operator with on a float64 tensor instead of a default float32.
NOTE: Similarly, you could also specify the input tensors to use for benchmarking.

Use Case 2.1 - Customize Inputs for Operators

Code Block
results = run_performance_test(F=mx.nd.add, ctx=mx.cpu(), warmup=10, runs=50, inputs=[{"lhs": (1024, 1024), "rhs": (1024, 1024), "initializer": nd.normal, "run_backward": True, "dtype": "float64"}])

Output for the above benchmark run, on a CPU machine, would look something like below:

Code Block
MX_Add_Forward_Backward_Time - 0.025401 seconds

Use Case 3 - Nightly CI Tests

We will maintain a JSON file of expected performance for each operator under "incubator-mxnet/benchmark/opperf".
These expected results are captured on different configuration such as - FP32/64/16, MKL, No MKL, CUDA10, instances (c5.16x, p3.8x).
Runs all the operator performance runs and gets the results JSON.
Compares with the expected results +/- % threshold.

Future Development and Ideas

Integration with MXNet Profiler to capture the time and memory usage.
Integrate with property based testing libraries like Hypothesis, to randomly generate test cases with different tensor shapes and inputs.

Alternate Solutions

Alternate Solution 1 - Use Python Classes for each Operator instead of Config

...

Code Block

language	py

class Add(MXNetOperatorBenchmarkBase):
    """Helps to Benchmark Tensor Add operation.

    By default benchmark both forward and backward element_wise tensor addition
    of 1024*1024 tensor of precision - 'float32'.

    """

    def __init__(self, ctx=mx.cpu(), warmup=10, runs=50, inputs=None):
        # Set the default Inputs
        default_parameters = {"lhs": (1024, 1024),
                              "rhs": (1024, 1024),
                              "initializer": nd.normal,
                              "run_backward": True,
                              "dtype": "float32"}

        super().__init__(ctx=ctx, warmup=warmup, runs=runs, default_parameters=default_parameters,
                         custom_parameters=inputs)

        self.lhs = get_mx_ndarray(ctx=self.ctx, in_tensor=self.inputs["lhs"],
                                  dtype, in_tensor=self.inputs["dtypelhs"],
                                  initializerdtype=self.inputs["initializerdtype"],
                                  attach_gradinitializer=self.inputs["run_backward"])
        self.rhs = get_mx_ndarray(ctx=self.ctx, in_tensor=self.inputs["rhsinitializer"],
                                  dtypeattach_grad=self.inputs["dtype"],
                   run_backward"])
        self.rhs =      initializerget_mx_ndarray(ctx=self.ctx, in_tensor=self.inputs["initializerrhs"],
                                  attach_graddtype=self.inputs["run_backwarddtype"]),

     def run_benchmark(self):
        # Warm up, ignore execution time value
        _, _ = nd_forward_backward_and_time(F=nd.add, runs=self.warmup, lhs=self.lhs, rhs initializer=self.rhs)
inputs["initializer"],
                     # Run Benchmarks
        exe_time, _ = nd_forward_backward_and_time(F=nd.add, runs=self.runs, lhs=self.lhs, rhs=self.rhs)
attach_grad=self.inputs["run_backward"])

    def run_benchmark(self):
        # Warm up, ignore execution time value
        self.results["MX_Add_Forward_Backward_Time"]_, _ = exe_time / self.runs

API / User Experience

We can define 2 types of users of the library and describe API interface for each of these users.

General User, Automated Nightly tests

Run benchmarks on all the operators or on specific categories of operators. Use default inputs provided by the library.

Power User, PR validation tests

Run benchmark

USE CASE 1 - Run benchmarks for all the operators

A driver to run all the MXNet operators (NDArray and Gluon) benchmarks with default inputs and saves the final result as JSON in the provided file.

Code Block
python dl-operator-benchmark/run_all_mxnet_operator_benchmarks.py --output-format json --output-file mxnet_operator_benchmark_results.json

Other Driver Script CLI Options:

output-format : json or md for markdown file output or csv.
ctx : By default, cpu on CPU machine, gpu(0) on GPU machine. You can override and set the global context for all operator benchmarks. Example: --ctx gpu(2).
dtype : By default, float32. You can override and set the global dtype for all operator benchmarks. Example: --dtype float64.

USE CASE 2 - Run benchmarks for all the operators in a specific category

For example, you want to run benchmarks for all NDArray Arithmetic Operators, the library will be providing drivers to easily run benchmarks on operators of specific categories.

Code Block

language	py

from mxnet_benchmarks.nd import run_all_arithmetic_operations_benchmarks
# Run all Arithmetic operations benchmarks with default input values
run_all_arithmetic_operations_benchmarks()

Output for the above benchmark run, on a CPU machine, would look something like below:

Code Block

MX_Add_Forward_Backward_Time - 0.015201 seconds
MX_Multiply_Forward_Backward_Time - 0.021678 seconds
MX_Subtract_Forward_Backward_Time - 0.016154 seconds
MX_Divide_Forward_Backward_Time - 0.024327 seconds
MX_Modulo_Forward_Backward_Time - 0.045726 seconds
MX_Power_Forward_Backward_Time - 0.077152 seconds
MX_Negative_Forward_Backward_Time - 0.014472 seconds
MX_Inplace_Add_Forward_Time - 0.003824 seconds
MX_Inplace_Subtract_Forward_Time - 0.004137 seconds
MX_Inplace_Multiply_Forward_Time - 0.006589 seconds
MX_Inplace_Division_Forward_Time - 0.003869 seconds
MX_Inplace_Modulo_Forward_Time - 0.018180 seconds

Use Case 3 - Power user - Run benchmarks for specific operator

As a power user, you want to run benchmarks for nd.add operator in MXNet, you just run the following python script.
Note that, we maintain same name and spec as the underlying MXNet operator. For example - to benchmark nd.add, we can use mxnet_benchmarks.nd.Add().

Use CASE 3.1 - Default Inputs for Operators

Code Block
from mxnet_benchmarks.nd import Add # Run all Arithmetic operations benchmarks with default input values add_benchmark = Add() add_benchmark.run_benchmark() add_benchmark.print_benchmark_results()

Output for the above benchmark run, on a CPU machine, would look something like below:

Code Block
MX_Add_Forward_Backward_Time - 0.015201 seconds

USE CASE 3.2 - Customize Inputs for Operators

As a power user, let us assume, you want to run benchmarks on a float64 tensor instead of a default float32.
NOTE: Similarly, you could also specify the input tensors to use for benchmarking.

Code Block

language	py

from mxnet_benchmarks.nd import Add
# Run all Arithmetic operations benchmarks with default input values
add_benchmark = Add(inputs={"dtype": "float64"})
add_benchmark.run_benchmark()
add_benchmark.print_benchmark_results()

Output for the above benchmark run, on a CPU machine, would look something like below:

Code Block
MX_Add_Forward_Backward_Time - 0.025405 seconds

...

nd_forward_backward_and_time(F=nd.add, runs=self.warmup, lhs=self.lhs, rhs=self.rhs)
        # Run Benchmarks
        exe_time, _ = nd_forward_backward_and_time(F=nd.add, runs=self.runs, lhs=self.lhs, rhs=self.rhs)

        self.results["MX_Add_Forward_Backward_Time"] = exe_time / self.runs

API / User Experience

We can define 2 types of users of the library and describe API interface for each of these users.

General User, Automated Nightly tests

Run benchmarks on all the operators or on specific categories of operators. Use default inputs provided by the library.

Power User, PR validation tests

Run benchmark

USE CASE 1 - Run benchmarks for all the operators

A driver to run all the MXNet operators (NDArray and Gluon) benchmarks with default inputs and saves the final result as JSON in the provided file.

Code Block
python dl-operator-benchmark/run_all_mxnet_operator_benchmarks.py --output-format json --output-file mxnet_operator_benchmark_results.json

Other Driver Script CLI Options:

output-format : json or md for markdown file output or csv.
ctx : By default, cpu on CPU machine, gpu(0) on GPU machine. You can override and set the global context for all operator benchmarks. Example: --ctx gpu(2).
dtype : By default, float32. You can override and set the global dtype for all operator benchmarks. Example: --dtype float64.

USE CASE 2 - Run benchmarks for all the operators in a specific category

For example, you want to run benchmarks for all NDArray Arithmetic Operators, the library will be providing drivers to easily run benchmarks on operators of specific categories.

Code Block

language	py

from mxnet_benchmarks.nd import run_all_arithmetic_operations_benchmarks
# Run all Arithmetic operations benchmarks with default input values
run_all_arithmetic_operations_benchmarks()

Output for the above benchmark run, on a CPU machine, would look something like below:

Code Block

MX_Add_Forward_Backward_Time - 0.015201 seconds
MX_Multiply_Forward_Backward_Time - 0.021678 seconds
MX_Subtract_Forward_Backward_Time - 0.016154 seconds
MX_Divide_Forward_Backward_Time - 0.024327 seconds
MX_Modulo_Forward_Backward_Time - 0.045726 seconds
MX_Power_Forward_Backward_Time - 0.077152 seconds
MX_Negative_Forward_Backward_Time - 0.014472 seconds
MX_Inplace_Add_Forward_Time - 0.003824 seconds
MX_Inplace_Subtract_Forward_Time - 0.004137 seconds
MX_Inplace_Multiply_Forward_Time - 0.006589 seconds
MX_Inplace_Division_Forward_Time - 0.003869 seconds
MX_Inplace_Modulo_Forward_Time - 0.018180 seconds

Use Case 3 - Power user - Run benchmarks for specific operator

As a power user, you want to run benchmarks for nd.add operator in MXNet, you just run the following python script.
Note that, we maintain same name and spec as the underlying MXNet operator. For example - to benchmark nd.add, we can use mxnet_benchmarks.nd.Add().

Use CASE 3.1 - Default Inputs for Operators

Code Block
from mxnet_benchmarks.nd import Add # Run all Arithmetic operations benchmarks with default input values add_benchmark = Add(inputs={"dtype": "float64"})print(add_benchmark.inputs)

Output

Code Block
{'lhs': (1024, 1024), 'rhs': (1024, 1024), 'initializer': <function normal at 0x117b607b8>, 'run_backward': True, 'dtype': 'float64'}

Pros

More control for users - default inputs, random inputs, specific user defined inputs.
Deterministic and better suited for performance benchmarks, reproducibility and CI integration.
With Python interface:
1. Easy to maintain and develop.
2. Reflects the performance as seen by the users. (Majority users using Python interface)
3. Fastest way to get performance tests in place. We do not have any tests in place as of today.
4. Ability to run and compare benchmarks from other deep learning frameworks.
Extensible:
1. Can be integrated with property based testing libraries like Hypothesis, to randomly generate test cases with different tensor shapes.

Cons

Need to write base tests for every new operator. If a new operator is added to MXNet, then a new performance test class for the operator needs to be added in this library with default inputs for that new operator to run performance tests.
It is ideal to capture performance close to Kernel. Call from Python operator APIs may hide performance regression when operator computation is small.

Alternate Solution 2 - Autogenerate test with Property Based Testing Technique

(Credits - Thanks to Pedro Larroy for this suggestion)

Approach

Automatically query all operators registered with MXNet engine.
Infer the inputs and outputs for the operators.
Use property based testing technique and library such as Hypothesis to generate random inputs and run the tests.

Pros

Any new operator added to MXNet, will be automatically queried. Hence, no need to write tests explicitly for every operator.
Inputs are randomly generated. Hence, better suited to capture performance regression on corner cases.

Cons

Non deterministic inputs. Hence, better suitable for functionality testing. It will be hard to use this technique for performance tests.
Still requires us to write many custom strategies or conditional property files. Example:
1. For testing Add operator, we need to set conditions on input to generate same shapes or broadcastable shapes for lhs and rhs.
2. For Convolution operator, we need to match Kernel, Padding and other parameter shapes appropriately.
Querying operators and inferring the input conditions may be hard and complex logic.
1. Example: add is an operator, that takes 2 input tensors - lhs, rhs. Now we need to infer that lhs and rhs tensor should of same size or broadcastable. Logic to handle such conditions may soon become complex enough to not give us advantage of auto generated operator benchmarks.
2. MXNet currently do no support a standard way of querying the registered operators. It would be ideal if MXNet can expose NNVM APIs for querying registered operators and expected inputs, outputs, types and more.
Complex and time consuming. We do not have any operator performance tests for MXNet. It would be ideal to revisit this approach for future enhancement.

Alternate Solution 3 - Extend existing unit tests to cover performance parameters

Development Plan / Milestones

current status

See this repo for more details - https://github.com/sandeep-krishnamurthy/dl-operator-benchmark

134 operators are supported:

All Gluon Layers - Activation, Loss, Normalization, Basic like Dense, Convolutions, Recurrent (RNN, LSTM, GRU)
NDArray operators like creation, random sampling, arithmetic, logical, comparison etc...

Able to run individual operator benchmarks or use high level drivers to run all tests.
Able to generate results as JSON.
Timing metric - forward only, forward+backward operation.

Development plan

...


# Run all Arithmetic operations benchmarks with default input values
add_benchmark = Add()
add_benchmark.run_benchmark()
add_benchmark.print_benchmark_results()

Output for the above benchmark run, on a CPU machine, would look something like below:

Code Block
MX_Add_Forward_Backward_Time - 0.015201 seconds

USE CASE 3.2 - Customize Inputs for Operators

As a power user, let us assume, you want to run benchmarks on a float64 tensor instead of a default float32.
NOTE: Similarly, you could also specify the input tensors to use for benchmarking.

Code Block

language	py

from mxnet_benchmarks.nd import Add
# Run all Arithmetic operations benchmarks with default input values
add_benchmark = Add(inputs={"dtype": "float64"})
add_benchmark.run_benchmark()
add_benchmark.print_benchmark_results()

Output for the above benchmark run, on a CPU machine, would look something like below:

Code Block
MX_Add_Forward_Backward_Time - 0.025405 seconds

NOTE: You can print the input parameters used for a benchmark as shown below.

Code Block
from mxnet_benchmarks.nd import Add # Run all Arithmetic operations benchmarks with default input values add_benchmark = Add(inputs={"dtype": "float64"})print(add_benchmark.inputs)

Output

Code Block
{'lhs': (1024, 1024), 'rhs': (1024, 1024), 'initializer': <function normal at 0x117b607b8>, 'run_backward': True, 'dtype': 'float64'}

Pros

More control for users - default inputs, random inputs, specific user defined inputs.
Deterministic and better suited for performance benchmarks, reproducibility and CI integration.
With Python interface:
1. Easy to maintain and develop.
2. Reflects the performance as seen by the users. (Majority users using Python interface)
3. Fastest way to get performance tests in place. We do not have any tests in place as of today.
4. Ability to run and compare benchmarks from other deep learning frameworks.
Extensible:
1. Can be integrated with property based testing libraries like Hypothesis, to randomly generate test cases with different tensor shapes.

Cons

Need to write base tests for every new operator. If a new operator is added to MXNet, then a new performance test class for the operator needs to be added in this library with default inputs for that new operator to run performance tests.
It is ideal to capture performance close to Kernel. Call from Python operator APIs may hide performance regression when operator computation is small.

Alternate Solution 2 - Autogenerate test with Property Based Testing Technique

(Credits - Thanks to Pedro Larroy for this suggestion)

Approach

Automatically query all operators registered with MXNet engine.
Infer the inputs and outputs for the operators.
Use property based testing technique and library such as Hypothesis to generate random inputs and run the tests.

Pros

Any new operator added to MXNet, will be automatically queried. Hence, no need to write tests explicitly for every operator.
Inputs are randomly generated. Hence, better suited to capture performance regression on corner cases.

Cons

Non deterministic inputs. Hence, better suitable for functionality testing. It will be hard to use this technique for performance tests.
Still requires us to write many custom strategies or conditional property files. Example:
1. For testing Add operator, we need to set conditions on input to generate same shapes or broadcastable shapes for lhs and rhs.
2. For Convolution operator, we need to match Kernel, Padding and other parameter shapes appropriately.
Querying operators and inferring the input conditions may be hard and complex logic.
1. Example: add is an operator, that takes 2 input tensors - lhs, rhs. Now we need to infer that lhs and rhs tensor should of same size or broadcastable. Logic to handle such conditions may soon become complex enough to not give us advantage of auto generated operator benchmarks.
2. MXNet currently do no support a standard way of querying the registered operators. It would be ideal if MXNet can expose NNVM APIs for querying registered operators and expected inputs, outputs, types and more.
Complex and time consuming. We do not have any operator performance tests for MXNet. It would be ideal to revisit this approach for future enhancement.

Alternate Solution 3 - Extend existing unit tests to cover performance parameters

<To add more details> In summary, it is hard and complex to modify all unit tests to measure performance along with currently designed way of writing tests which is designed towards - consistency across context, correctness, gradient checks.

Appendix

Our objective is to capture the performance at basic individual operator level.
Symbol APIs is planned to be deprecated soon for users.
Users currently use NDArray operations or use Gluon Layers in imperative (NDArray) or hybrid mode (Symbolic).
In Phase 2 we will be benchmarking Gluon Hybrid Layers individually that should cover the symbolic operations exposed to users.
Also, under the hood, the kernel is same for NDArray and Symbols. hence, we are not missing any tests.

...

Phase 1

Functionality supported:

...

Page tree

Page History

Versions Compared

Old Version 8

New Version Current

Key

Addition of new Module

API / User Experience

How does the backend profiling utility code looks like?

Addition of new Module

Addition of new API

1) mxnet.profiler.get_summary(reset=False)

API / User Experience

Use Case 1 - Run benchmarks for all the operators

Use Case 2 - Power user - Run benchmarks for specific operator

Use Case 2.1 - Customize Inputs for Operators

Use Case 3 - Nightly CI Tests

Development Plan / Milestones

Current Status

Use Case 1 - Run benchmarks for all the operators

Use Case 2 - Power user - Run benchmarks for specific operator

Use Case 2.1 - Customize Inputs for Operators

Use Case 3 - Nightly CI Tests

Future Development and Ideas

Alternate Solutions

Alternate Solution 1 - Use Python Classes for each Operator instead of Config

API / User Experience

USE CASE 1 - Run benchmarks for all the operators

USE CASE 2 - Run benchmarks for all the operators in a specific category

Use Case 3 - Power user - Run benchmarks for specific operator

Use CASE 3.1 - Default Inputs for Operators

USE CASE 3.2 - Customize Inputs for Operators

API / User Experience

USE CASE 1 - Run benchmarks for all the operators

USE CASE 2 - Run benchmarks for all the operators in a specific category

Use Case 3 - Power user - Run benchmarks for specific operator

Use CASE 3.1 - Default Inputs for Operators

Pros

Cons

Alternate Solution 2 - Autogenerate test with Property Based Testing Technique

Alternate Solution 3 - Extend existing unit tests to cover performance parameters

Development Plan / Milestones

current status

Development plan

USE CASE 3.2 - Customize Inputs for Operators

Pros

Cons

Alternate Solution 2 - Autogenerate test with Property Based Testing Technique

Alternate Solution 3 - Extend existing unit tests to cover performance parameters

Appendix

Phase 1