Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Defaults => Common use cases should be extremely easy, customized complex use cases should be possible.
    1. Example: I should be able to run Add operator benchmarks without specifying any inputs and library should provide benchmarks on valid default inputs. At the same time, as a power user, I should be able to provide my own inputs such as Tensor Shapes and context to run the benchmarks.
  2. Minimum Learning Curve => Keep APIs same or close to native NDArray / Gluon Operators being benchmarked.
    1. Example: If I am doing benchmarks on nd.add(lhs, rhs) operator, interface in the benchmark utility should be similar with zero learning curve.
  3. Modular and Reusable
  4. For a programmer or an automated system
    1. Example: Developer using the library or integration with CI/CD

Proposed Approach

  1. This benchmark utility will be built on top of MXNet's ND and Gluon interface.
  2. For each operator in ND and Gluon Block, there will be a corresponding Benchmarking operator in the library with a list of default inputs, functionality to process results. See below example for Add operator benchmarks.
  3. High-level drivers are provided to run operator benchmarks in bulk. Example: run_all_mxnet_operator_benchmarks(), run_all_arithmetic_operations_benchmarks() etc.
  4. Results can be generated as a python dictionary/JSON/CSV for upstream system (Ex: CI, Automated Performance Monitoring System) consumption.
  1. Provide a generic utility for executing an operator benchmarks and performance tests.
    1. This is responsible to creating input tensors of required shape on a given dtype, context.
    2. Execute the provided operator - forward or forward + backward.
    3. Capture profile output - time, memory.
    4. Return a dictionary of results.
  2. Input for the performance tests will be a key/value config.

Below is an example of performance runs for operators. It uses a base utility `run_performance_test`.

Code Block
languagepy
"""
MXNet operator performance benchmarks.

NOTE:
1. You can pass list of input dictionary to run benchmarks for an operator with different input configuration.
2. Results are dictionary of time, memory for the benchmark runs.
"""

# Run performance test for Add operator
results = run_performance_test(F=mx.nd.add
Code Block
languagepy
class Add(MXNetOperatorBenchmarkBase):
    """Helps to Benchmark Tensor Add operation.

    By default benchmark both forward and backward element_wise tensor addition
    of 1024*1024 tensor of precision - 'float32'.

    """

    def __init__(self, ctx=mx.cpu(), warmup=10, runs=50, inputs=None):
[{"lhs": (1024, 1024),
         # Set the default Inputs
        default_parameters = {"lhs": (1024, 1024),
               												               "rhs": (1024, 1024),
                              												          "initializer": nd.normal,
                              												          "run_backward": True,
                              "dtype": "												          "dtype": "float32"}])

# Run performance test for Conv2D operator
results  super().__init__(ctx=ctx+= run_performance_test(F=nn.gluon.Conv2D, ctx=mx.cpu(), warmup=warmup10, runs=runs50, default_parameters=default_parameters,
     inputs = [{"data": (32, 3, 256, 256),
                    custom_parameters=inputs)

          self.lhs = get_mx_ndarray(ctx=self.ctx, in_tensor=self.inputs["lhs"]																  "data_initializer": nd.normal,
                              																  "channels": 64,
  dtype=self.inputs["dtype"],
                            																  "kernel_size":    initializer=self.inputs["initializer"](3, 3),
                                  attach_grad=self.inputs["run_backward"])
        self.rhs = get_mx_ndarray(ctx=self.ctx, in_tensor=self.inputs["rhs"]																  "strides": (1, 1),
                              																  "padding":  dtype=self.inputs["dtype"](0, 0),
                                  initializer=self.inputs["initializer"]																  "dilation": (1, 1),
                              																    attach_grad=self.inputs["run_backward"])

"layout": "NCHW",
    def run_benchmark(self):
        # Warm up, ignore execution time value
        _, _ = nd_forward_backward_and_time(F=nd.add, runs=self.warmup, lhs=self.lhs, rhs=self.rhs)
 																  "activation": None,
         # Run Benchmarks
        exe_time, _ = nd_forward_backward_and_time(F=nd.add, runs=self.runs, lhs=self.lhs, rhs=self.rhs)

     																   self.results["MX_Add_Forward_Backward_Time"] = exe_time / self.runs

Addition of new Module

We propose to add this library as a new module (opperf) under incubator-mxnet/benchmark as "incubator-mxnet/benchmark/opperf". Note that, this does not generate any user facing APIs, this is a library under incubator-mxnet/benchmark folder for general use by community.

API / User Experience

We can define 2 types of users of the library and describe API interface for each of these users.

  1. General User, Automated Nightly tests
    1. Run benchmarks on all the operators or on specific categories of operators. Use default inputs provided by the library.
  2. Power User, PR validation tests
    1. Run benchmark

USE CASE 1 - Run benchmarks for all the operators

A driver to run all the MXNet operators (NDArray and Gluon) benchmarks with default inputs and saves the final result as JSON in the provided file.

Code Block
python dl-operator-benchmark/run_all_mxnet_operator_benchmarks.py --output-format json --output-file mxnet_operator_benchmark_results.json

Other Driver Script CLI Options:

  1. output-format : json or md for markdown file output or csv.
  2. ctx : By default, cpu on CPU machine, gpu(0) on GPU machine. You can override and set the global context for all operator benchmarks. Example: --ctx gpu(2).
  3. dtype : By default, float32. You can override and set the global dtype for all operator benchmarks. Example: --dtype float64.

USE CASE 2 - Run benchmarks for all the operators in a specific category

For example, you want to run benchmarks for all NDArray Arithmetic Operators, the library will be providing drivers to easily run benchmarks on operators of specific categories.

Code Block
languagepy
from mxnet_benchmarks.nd import run_all_arithmetic_operations_benchmarks
# Run all Arithmetic operations benchmarks with default input values
run_all_arithmetic_operations_benchmarks()

Output for the above benchmark run, on a CPU machine, would look something like below:

Code Block
MX_Add_Forward_Backward_Time - 0.015201 seconds
MX_Multiply_Forward_Backward_Time - 0.021678 seconds
MX_Subtract_Forward_Backward_Time - 0.016154 seconds
MX_Divide_Forward_Backward_Time - 0.024327 seconds
MX_Modulo_Forward_Backward_Time - 0.045726 seconds
MX_Power_Forward_Backward_Time - 0.077152 seconds
MX_Negative_Forward_Backward_Time - 0.014472 seconds
MX_Inplace_Add_Forward_Time - 0.003824 seconds
MX_Inplace_Subtract_Forward_Time - 0.004137 seconds
MX_Inplace_Multiply_Forward_Time - 0.006589 seconds
MX_Inplace_Division_Forward_Time - 0.003869 seconds
MX_Inplace_Modulo_Forward_Time - 0.018180 seconds

...

run_backward": True,
                              																  "dtype": "float32"]}


Pros

  1. No need to write 1 class per operator to set up a performance test. Whenever a new operator is created, developer needs to add a `run_performance_test(..)` line with a list of inputs to run performance tests. A generic utility will handle the execution.
  2. Less code, easy to maintain.
  3. More control for users - default inputs, random inputs, specific user defined inputs.
  4. Deterministic and better suited for performance benchmarks, reproducibility and CI integration.
  5. With Python interface:
    1. Easy to maintain and develop.
    2. Reflects the performance as seen by the users. (Majority users using Python interface)
    3. Fastest way to get performance tests in place. We do not have any tests in place as of today.

Cons

  1. Different operator will have different input names. For example, see above, add operator requires tensors with name lhs, rhs. However, Conv2D operator requires a tensor with data. The base performance executor utility will need to understand it and create tensors appropriately i.e., If it is one single executor, generalization across operator performance may make logic complex to manage.
  2. Not easily extensible:
    1. Hard to integrated with property based testing libraries like Hypothesis, to randomly generate test cases with different tensor shapes.
  3. It is ideal to capture performance close to Kernel. Call from Python operator APIs may hide performance regression when operator computation is small.

Addition of new Module

We propose to add this utility as a new module (opperf) under incubator-mxnet/benchmark as "incubator-mxnet/benchmark/opperf". Note that, this does not generate any user facing APIs, this is a utility under incubator-mxnet/benchmark folder for general use by community.

API / User Experience

We can define 2 types of users of the library and describe API interface for each of these users.

  1. General User, Automated Nightly tests
    1. Run benchmarks on all the operators or on specific categories of operators. Use default inputs provided by the library.
  2. Power User, PR validation tests
    1. Run benchmark with customized Inputs

Use Case 1 - Run benchmarks for all the operators

A driver to run all the MXNet operators (NDArray and Gluon) benchmarks with default inputs and saves the final result as JSON in the provided file.

Code Block
python incubator-mxnet/benchmark/opperf/run_all_mxnet_operator_benchmarks.py --output-format json --output-file mxnet_operator_benchmark_results.json

Other Driver Script CLI Options:

  1. output-format : json or md for markdown file output or csv.
  2. ctx : By default, cpu on CPU machine, gpu(0) on GPU machine. You can override and set the global context for all operator benchmarks. Example: --ctx gpu(2).
  3. dtype : By default, float32. You can override and set the global dtype for all operator benchmarks. Example: --dtype float64.

Output for the above benchmark run, on a CPU machine, would look something like below:

Code Block
{
    "MX_Multiply_Forward_Backward_Time": 0.025911798477172853,
    "MX_Gluon_Imperative_RNN_Forward_Backward_Time": 0.011011338233947754,
    "MX_Gluon_Imperative_MaxPool2D_Forward_Backward_Time": 0.1580966854095459,
    "MX_Gluon_Imperative_Conv1D_Forward_Backward_Time": 0.03413449287414551,
    "MX_Ones_Forward_Time": 0.002405076026916504,
    "MX_Modulo_Forward_Backward_Time": 0.049943366050720216,
    "MX_Subtract_Forward_Backward_Time": 0.01635995864868164,
    "MX_ArgMin_Forward_Backward_Time": 0.01545732021331787,
    "MX_Logical_Xor_Forward_Backward_Time": 0.018084139823913575,
    "MX_Zeros_Like_Forward_Time": 0.0027973604202270507,
    "MX_Inplace_Multiply_Forward_Time": 0.005555639266967774,
    "MX_ArgSort_Forward_Time": 0.13972537994384765,
    "MX_Arange_Forward_Time": 0.00010946273803710938,
........
........
}

Use Case 2 - Power user - Run benchmarks for specific operator

As a power user, let us assume, you want to run benchmarks for nd.add operator in MXNet, you just run the following python script.
Note that, we maintain same name and spec as the underlying MXNet operator. For example - to benchmark nd.add, we can use mxnet_benchmarks.nd.Add().

...

on Add operator with on a float64 tensor instead of a default float32.
NOTE: Similarly, you could also specify the input tensors to use for benchmarking.

Use Case 2.1 - Customize Inputs for Operators

Code Block
from mxnet_benchmarks.nd import Add
# Run all Arithmetic operations benchmarks with default input values
add_benchmark = Add()
add_benchmark.run_benchmark()
add_benchmark.print_benchmark_results()

Output for the above benchmark run, on a CPU machine, would look something like below:

Code Block
MX_Add_Forward_Backward_Time - 0.015201 seconds

USE CASE 3.2 - Customize Inputs for Operators

As a power user, let us assume, you want to run benchmarks on a float64 tensor instead of a default float32.
NOTE: Similarly, you could also specify the input tensors to use for benchmarking.

Code Block
languagepy
from mxnet_benchmarks.nd import Add
# Run all Arithmetic operations benchmarks with default input values
add_benchmark = Add(inputs={"dtype": "float64"})
add_benchmark.run_benchmark()
add_benchmark.print_benchmark_results()

Output for the above benchmark run, on a CPU machine, would look something like below:

Code Block
MX_Add_Forward_Backward_Time - 0.025405 seconds

NOTE: You can print the input parameters used for a benchmark as shown below.

Code Block
from mxnet_benchmarks.nd import Add
# Run all Arithmetic operations benchmarks with default input values
add_benchmark = Add(inputs={"dtype": "float64"})print(add_benchmark.inputs)

Output

Code Block
{'lhs': (1024, 1024), 'rhs': (1024, 1024), 'initializer': <function normal at 0x117b607b8>, 'run_backward': True, 'dtype': 'float64'}

Pros

results = run_performance_test(F=mx.nd.add, ctx=mx.cpu(), warmup=10, runs=50, inputs=[{"lhs": (1024, 1024),
                              												          "rhs": (1024, 1024),
                              												          "initializer": nd.normal,
                              												          "run_backward": True,
                              												          "dtype": "float64"}])

Output for the above benchmark run, on a CPU machine, would look something like below:

Code Block
MX_Add_Forward_Backward_Time - 0.025401 seconds

Use Case 3 - Nightly CI Tests

  1. We will maintain a JSON file of expected performance for each operator under "incubator-mxnet/benchmark/opperf".
  2. These expected results are captured on different configuration such as - FP32/64/16, MKL, No MKL, CUDA10, instances (c5.16x, p3.8x).
  3. Runs all the operator performance runs and gets the results JSON.
  4. Compares with the expected results +/- % threshold.

Future Development and Ideas

  1. Integration with MXNet Profiler to capture the time and memory usage.
  2. Integrate
  3. More control for users - default inputs, random inputs, specific user defined inputs.
  4. Deterministic and better suited for performance benchmarks, reproducibility and CI integration.
  5. With Python interface:
    1. Easy to maintain and develop.
    2. Reflects the performance as seen by the users. (Majority users using Python interface)
    3. Fastest way to get performance tests in place. We do not have any tests in place as of today.
    4. Ability to run and compare benchmarks from other deep learning frameworks.
  6. Extensible:Can be integrated with property based testing libraries like Hypothesis, to randomly generate test cases with different tensor shapes and inputs.

Cons

  1. Need to write base tests for every new operator. If a new operator is added to MXNet, then a new performance test class for the operator needs to be added in this library with default inputs for that new operator to run performance tests.
  2. It is ideal to capture performance close to Kernel. Call from Python operator APIs may hide performance regression when operator computation is small.

Future Development and Ideas

  1. Integration with MXNet Profiler to capture the time and memory usage.
  2. Integrate with property based testing libraries like Hypothesis, to randomly generate test cases with different tensor shapes and inputs.

Alternate Solutions

Alternate Solution 1 - Config based input rather than one class per operator

Approach

  1. Provide one generic utility for executing an operator and benchmark the performance.
  2. Input for the performance tests will be a key/value config.

Below is an example executor of performance runs for operators. It uses a base utility `run_performance_test`.

Alternate Solutions

Alternate Solution 1 - Use Python Classes for each Operator instead of Config

Approach

  1. This benchmark utility will be built on top of MXNet's ND and Gluon interface.
  2. For each operator in ND and Gluon Block, there will be a corresponding Benchmarking operator in the library with a list of default inputs, functionality to process results. See below example for Add operator benchmarks.
  3. High-level drivers are provided to run operator benchmarks in bulk. Example: run_all_mxnet_operator_benchmarks(), run_all_arithmetic_operations_benchmarks() etc.
  4. Results can be generated as a python dictionary/JSON/CSV for upstream system (Ex: CI, Automated Performance Monitoring System) consumption.
Code Block
languagepy
class Add(MXNetOperatorBenchmarkBase):
    """Helps to Benchmark Tensor Add operation.

    By default benchmark both forward and backward element_wise tensor addition
    of 1024*1024 tensor of precision - 'float32'.

    """

    def __init__(self
Code Block
languagepy
"""
MXNet operator performance benchmarks.

NOTE:
1. You can pass list of input dictionary to run benchmarks for an operator with different input configuration.
2. Results are dictionary of time, memory for the benchmark runs.
"""

# Run performance test for Add operator
results = run_performance_test(F=mx.nd.add, ctx=mx.cpu(), warmup=10, runs=50, inputs=[None):
        # Set the default Inputs
        default_parameters = {"lhs": (1024, 1024),
                              												          "rhs": (1024, 1024),
                              												          "initializer": nd"initializer": nd.normal,
                              												          "run_backward": True,
                              												          "dtype": "float32"}])

# Run performance test for Conv2D operator
results += run_performance_test(F=nn.gluon.Conv2D, ctx=mx.cpu() super().__init__(ctx=ctx, warmup=10warmup, runs=50runs, inputs = [{"data": (32, 3, 256, 256),default_parameters=default_parameters,
                         custom_parameters=inputs)

     																  "data_initializer": nd.normal,
   self.lhs = get_mx_ndarray(ctx=self.ctx, in_tensor=self.inputs["lhs"],
                            																  "channels": 64      dtype=self.inputs["dtype"],
                              																  "kernel_size": (3, 3),
 initializer=self.inputs["initializer"],
                                																  "strides": (1, 1),
  attach_grad=self.inputs["run_backward"])
        self.rhs = get_mx_ndarray(ctx=self.ctx, in_tensor=self.inputs["rhs"],
                                																  "padding": (0, 0),
  dtype=self.inputs["dtype"],
                                																  "dilation": (1, 1),
  initializer=self.inputs["initializer"],
                               																  "layout": "NCHW", attach_grad=self.inputs["run_backward"])

    def run_benchmark(self):
        # Warm up, ignore execution time value
        _, _  																  "activation": None,
                              																  "run_backward": True,
                              																  "dtype": "float32"]}


Pros

...

= nd_forward_backward_and_time(F=nd.add, runs=self.warmup, lhs=self.lhs, rhs=self.rhs)
        # Run Benchmarks
        exe_time, _ = nd_forward_backward_and_time(F=nd.add, runs=self.runs, lhs=self.lhs, rhs=self.rhs)

        self.results["MX_Add_Forward_Backward_Time"] = exe_time / self.runs

API / User Experience

We can define 2 types of users of the library and describe API interface for each of these users.

  1. General User, Automated Nightly tests
    1. Run benchmarks on all the operators or on specific categories of operators. Use default inputs provided by the library.
  2. Power User, PR validation tests
    1. Run benchmark

USE CASE 1 - Run benchmarks for all the operators

A driver to run all the MXNet operators (NDArray and Gluon) benchmarks with default inputs and saves the final result as JSON in the provided file.

Code Block
python dl-operator-benchmark/run_all_mxnet_operator_benchmarks.py --output-format json --output-file mxnet_operator_benchmark_results.json

Other Driver Script CLI Options:

  1. output-format : json or md for markdown file output or csv.
  2. ctx : By default, cpu on CPU machine, gpu(0) on GPU machine. You can override and set the global context for all operator benchmarks. Example: --ctx gpu(2).
  3. dtype : By default, float32. You can override and set the global dtype for all operator benchmarks. Example: --dtype float64.

USE CASE 2 - Run benchmarks for all the operators in a specific category

For example, you want to run benchmarks for all NDArray Arithmetic Operators, the library will be providing drivers to easily run benchmarks on operators of specific categories.

Code Block
languagepy
from mxnet_benchmarks.nd import run_all_arithmetic_operations_benchmarks
# Run all Arithmetic operations benchmarks with default input values
run_all_arithmetic_operations_benchmarks()

Output for the above benchmark run, on a CPU machine, would look something like below:

Code Block
MX_Add_Forward_Backward_Time - 0.015201 seconds
MX_Multiply_Forward_Backward_Time - 0.021678 seconds
MX_Subtract_Forward_Backward_Time - 0.016154 seconds
MX_Divide_Forward_Backward_Time - 0.024327 seconds
MX_Modulo_Forward_Backward_Time - 0.045726 seconds
MX_Power_Forward_Backward_Time - 0.077152 seconds
MX_Negative_Forward_Backward_Time - 0.014472 seconds
MX_Inplace_Add_Forward_Time - 0.003824 seconds
MX_Inplace_Subtract_Forward_Time - 0.004137 seconds
MX_Inplace_Multiply_Forward_Time - 0.006589 seconds
MX_Inplace_Division_Forward_Time - 0.003869 seconds
MX_Inplace_Modulo_Forward_Time - 0.018180 seconds

Use Case 3 - Power user - Run benchmarks for specific operator

As a power user, you want to run benchmarks for nd.add operator in MXNet, you just run the following python script.
Note that, we maintain same name and spec as the underlying MXNet operator. For example - to benchmark nd.add, we can use mxnet_benchmarks.nd.Add().

Use CASE 3.1 - Default Inputs for Operators

Code Block
from mxnet_benchmarks.nd import Add
# Run all Arithmetic operations benchmarks with default input values
add_benchmark = Add()
add_benchmark.run_benchmark()
add_benchmark.print_benchmark_results()

Output for the above benchmark run, on a CPU machine, would look something like below:

Code Block
MX_Add_Forward_Backward_Time - 0.015201 seconds

USE CASE 3.2 - Customize Inputs for Operators

As a power user, let us assume, you want to run benchmarks on a float64 tensor instead of a default float32.
NOTE: Similarly, you could also specify the input tensors to use for benchmarking.

Code Block
languagepy
from mxnet_benchmarks.nd import Add
# Run all Arithmetic operations benchmarks with default input values
add_benchmark = Add(inputs={"dtype": "float64"})
add_benchmark.run_benchmark()
add_benchmark.print_benchmark_results()

Output for the above benchmark run, on a CPU machine, would look something like below:


Code Block
MX_Add_Forward_Backward_Time - 0.025405 seconds


NOTE: You can print the input parameters used for a benchmark as shown below.

Code Block
from mxnet_benchmarks.nd import Add
# Run all Arithmetic operations benchmarks with default input values
add_benchmark = Add(inputs={"dtype": "float64"})print(add_benchmark.inputs)

Output


Code Block
{'lhs': (1024, 1024), 'rhs': (1024, 1024), 'initializer': <function normal at 0x117b607b8>, 'run_backward': True, 'dtype': 'float64'}

Pros

...

  1. More control for users - default inputs, random inputs, specific user defined inputs.
  2. Deterministic and better suited for performance benchmarks, reproducibility and CI integration.
  3. With Python interface:
    1. Easy to maintain and develop.
    2. Reflects the performance as seen by the users. (Majority users using Python interface)
    3. Fastest way to get performance tests in place. We do not have any tests in place as of today.

Cons

    1. of today.
    2. Ability to run and compare benchmarks from other deep learning frameworks.
  1. Extensible:
    1. Can be
  2. Different operator will have different input names. For example, see above, add operator requires tensors with name lhs, rhs. However, Conv2D operator requires a tensor with data. The base performance executor utility will need to understand it and create tensors appropriately i.e., If it is one single executor, generalization across operator performance may make logic complex to manage.
  3. Not Extensible:
    1. Hard to integrated with property based testing libraries like Hypothesis, to randomly generate test cases with different tensor shapes.

Cons

  1. Need to write base tests for every new operator. If a new operator is added to MXNet, then a new performance test class for the operator needs to be added in this library with default inputs for that new operator to run performance tests.
  2. It is ideal to capture performance close to Kernel. Call from Python operator APIs may hide performance regression when operator computation is small.

...