Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. We will maintain a JSON file of expected performance for each operator under "incubator-mxnet/benchmark/opperf".
  2. These expected results are captured on different configuration such as - FP32/64/16, MKL, No MKL, CUDA10, instances (c5.16x, p3.8x).
  3. Runs all the operator performance runs and gets the results JSON.
  4. Compares with the expected results +/- % threshold.

Future Development and Ideas

  1. Integration with MXNet Profiler to capture the time and memory usage.
  2. Integrate with property based testing libraries like Hypothesis, to randomly generate test cases with different tensor shapes and inputs.

Alternate Solutions

Alternate Solution 1 - Use Python Classes for each Operator instead of Config

Approach

  1. This benchmark utility will be built on top of MXNet's ND and Gluon interface.
  2. For each operator in ND and Gluon Block, there will be a corresponding Benchmarking operator in the library with a list of default inputs, functionality to process results. See below example for Add operator benchmarks.
  3. High-level drivers are provided to run operator benchmarks in bulk. Example: run_all_mxnet_operator_benchmarks(), run_all_arithmetic_operations_benchmarks() etc.
  4. Results can be generated as a python dictionary/JSON/CSV for upstream system (Ex: CI, Automated Performance Monitoring System) consumption.

Development Plan / Milestones

Phase 1

  1. ~150 most commonly used operators will be tested on CPU(with and without MKL), GPU, FP32, FP64. See Appendix 1 for list of operators.
  2. Operators will be tested with NDArray and Gluon interface only i.e., symbol interface is not used for testing owing to plans of deprecation.
  3. Python interface is used - faster and get a check in place.
  4. Only timing is measured to start with.
  5. Statistics - Mean of the metric.

Phase 2

  1. Cover remaining operators left out from Phase 1.
  2. Support memory performance measurements.
  3. Integrate with MXNet Profiler to capture - time, memory metrics.
  4. Add more statistics - p50, p90, p99.

Phase 3

  1. Explore and have CPP performance tests for most commonly used operators. This will give the true measurements compared to using Python Interface.
  2. Integrate with property based testing libraries like Hypothesis, to randomly generate test cases with different tensor shapes and inputs.

Current Status

See this repo for more details - https://github.com/sandeep-krishnamurthy/dl-operator-benchmark

  1. 134 operators are supported:
    1. All Gluon Layers - Activation, Loss, Normalization, Basic like Dense, Convolutions, Recurrent (RNN, LSTM, GRU)
    2. NDArray operators like creation, random sampling, arithmetic, logical, comparison etc...
  2. Able to run individual operator benchmarks or use high level drivers to run all tests.
  3. Able to generate results as JSON.
  4. Timing metric - forward only, forward+backward operation.

Alternate Solutions

Alternate Solution 1 - Use Python Classes for each Operator instead of Config

Approach

  1. This benchmark utility will be built on top of MXNet's ND and Gluon interface.
  2. For each operator in ND and Gluon Block, there will be a corresponding Benchmarking operator in the library with a list of default inputs, functionality to process results. See below example for Add operator benchmarks.
  3. High-level drivers are provided to run operator benchmarks in bulk. Example: run_all_mxnet_operator_benchmarks(), run_all_arithmetic_operations_benchmarks() etc.
  4. Results can be generated as a python dictionary/JSON/CSV for upstream system (Ex: CI, Automated Performance Monitoring System) consumption.
Code Block
languagepy
class Add(MXNetOperatorBenchmarkBase):
    """Helps to Benchmark Tensor Add operation.

    By default benchmark both forward and backward element_wise tensor addition
    of 1024*1024 tensor of precision - 'float32'.

    """

    def __init__(self, ctx=mx.cpu(), warmup=10, runs=50, inputs=None):
        # Set the default Inputs
        default_parameters = {"lhs": (1024, 1024),
Code Block
languagepy
class Add(MXNetOperatorBenchmarkBase):
    """Helps to Benchmark Tensor Add operation.

    By default benchmark both forward and backward element_wise tensor addition
    of 1024*1024 tensor of precision - 'float32'.

    """

    def __init__(self, ctx=mx.cpu(), warmup=10, runs=50, inputs=None):
        # Set the default Inputs
        default_parameters = {"lhs": (1024, 1024),
                              "rhs": (1024, 1024),
                              "initializer": nd.normal,
                              "run_backward": True,
                              "dtype": "float32"}

        super().__init__(ctx=ctx, warmup=warmup, runs=runs, default_parameters=default_parameters,
                         custom_parameters=inputs)

        self.lhs = get_mx_ndarray(ctx=self.ctx, in_tensor=self.inputs["lhs"],
                                  dtype=self.inputs["dtype"],
                                  initializer=self.inputs["initializer"],
             "rhs": (1024, 1024),
                   attach_grad=self.inputs["run_backward"])
        self.rhs = get_mx_ndarray(ctx=self.ctx, in_tensor=self.inputs["rhs"]   "initializer": nd.normal,
                                  dtype=self.inputs["dtype"]"run_backward": True,
                              "dtype": "float32"}

   initializer=self.inputs["initializer"],
     super().__init__(ctx=ctx, warmup=warmup, runs=runs, default_parameters=default_parameters,
                          attachcustom_gradparameters=self.inputs["run_backward"])

    def run_benchmark(self):
        # Warm up, ignore execution time value
        _, _ = nd_forward_backward_and_time(F=nd.add, runs=self.warmup, lhs=self.lhs, rhs=self.rhs)self.lhs = get_mx_ndarray(ctx=self.ctx, in_tensor=self.inputs["lhs"],
        # Run Benchmarks
                      exe_time, _ = nd_forward_backward_and_time(F=nd.add, runs dtype=self.runs, lhs=self.lhs, rhs=self.rhs)

inputs["dtype"],
             self.results["MX_Add_Forward_Backward_Time"] = exe_time / self.runs

API / User Experience

We can define 2 types of users of the library and describe API interface for each of these users.

  1. General User, Automated Nightly tests
    1. Run benchmarks on all the operators or on specific categories of operators. Use default inputs provided by the library.
  2. Power User, PR validation tests
    1. Run benchmark

USE CASE 1 - Run benchmarks for all the operators

A driver to run all the MXNet operators (NDArray and Gluon) benchmarks with default inputs and saves the final result as JSON in the provided file.

Code Block
python dl-operator-benchmark/run_all_mxnet_operator_benchmarks.py --output-format json --output-file mxnet_operator_benchmark_results.json

Other Driver Script CLI Options:

  1. output-format : json or md for markdown file output or csv.
  2. ctx : By default, cpu on CPU machine, gpu(0) on GPU machine. You can override and set the global context for all operator benchmarks. Example: --ctx gpu(2).
  3. dtype : By default, float32. You can override and set the global dtype for all operator benchmarks. Example: --dtype float64.

USE CASE 2 - Run benchmarks for all the operators in a specific category

For example, you want to run benchmarks for all NDArray Arithmetic Operators, the library will be providing drivers to easily run benchmarks on operators of specific categories.

Code Block
languagepy
from mxnet_benchmarks.nd import run_all_arithmetic_operations_benchmarks
# Run all Arithmetic operations benchmarks with default input values
run_all_arithmetic_operations_benchmarks()

Output for the above benchmark run, on a CPU machine, would look something like below:

Code Block
MX_Add_Forward_Backward_Time - 0.015201 seconds
MX_Multiply_Forward_Backward_Time - 0.021678 seconds
MX_Subtract_Forward_Backward_Time - 0.016154 seconds
MX_Divide_Forward_Backward_Time - 0.024327 seconds
MX_Modulo_Forward_Backward_Time - 0.045726 seconds
MX_Power_Forward_Backward_Time - 0.077152 seconds
MX_Negative_Forward_Backward_Time - 0.014472 seconds
MX_Inplace_Add_Forward_Time - 0.003824 seconds
MX_Inplace_Subtract_Forward_Time - 0.004137 seconds
MX_Inplace_Multiply_Forward_Time - 0.006589 seconds
MX_Inplace_Division_Forward_Time - 0.003869 seconds
MX_Inplace_Modulo_Forward_Time - 0.018180 seconds

Use Case 3 - Power user - Run benchmarks for specific operator

As a power user, you want to run benchmarks for nd.add operator in MXNet, you just run the following python script.
Note that, we maintain same name and spec as the underlying MXNet operator. For example - to benchmark nd.add, we can use mxnet_benchmarks.nd.Add().

Use CASE 3.1 - Default Inputs for Operators

Code Block
from mxnet_benchmarks.nd import Add
# Run all Arithmetic operations benchmarks with default input values
add_benchmark = Add()
add_benchmark.run_benchmark()
add_benchmark.print_benchmark_results()

Output for the above benchmark run, on a CPU machine, would look something like below:

Code Block
MX_Add_Forward_Backward_Time - 0.015201 seconds

USE CASE 3.2 - Customize Inputs for Operators

                     initializer=self.inputs["initializer"],
                                  attach_grad=self.inputs["run_backward"])
        self.rhs = get_mx_ndarray(ctx=self.ctx, in_tensor=self.inputs["rhs"],
                                  dtype=self.inputs["dtype"],
                                  initializer=self.inputs["initializer"],
                                  attach_grad=self.inputs["run_backward"])

    def run_benchmark(self):
        # Warm up, ignore execution time value
        _, _ = nd_forward_backward_and_time(F=nd.add, runs=self.warmup, lhs=self.lhs, rhs=self.rhs)
        # Run Benchmarks
        exe_time, _ = nd_forward_backward_and_time(F=nd.add, runs=self.runs, lhs=self.lhs, rhs=self.rhs)

        self.results["MX_Add_Forward_Backward_Time"] = exe_time / self.runs

API / User Experience

We can define 2 types of users of the library and describe API interface for each of these users.

  1. General User, Automated Nightly tests
    1. Run benchmarks on all the operators or on specific categories of operators. Use default inputs provided by the library.
  2. Power User, PR validation tests
    1. Run benchmark

USE CASE 1 - Run benchmarks for all the operators

A driver to run all the MXNet operators (NDArray and Gluon) benchmarks with default inputs and saves the final result as JSON in the provided file.

Code Block
python dl-operator-benchmark/run_all_mxnet_operator_benchmarks.py --output-format json --output-file mxnet_operator_benchmark_results.json

Other Driver Script CLI Options:

  1. output-format : json or md for markdown file output or csv.
  2. ctx : By default, cpu on CPU machine, gpu(0) on GPU machine. You can override and set the global context for all operator benchmarks. Example: --ctx gpu(2).
  3. dtype : By default, float32. You can override and set the global dtype for all operator benchmarks. Example: --dtype float64.

USE CASE 2 - Run benchmarks for all the operators in a specific category

For example, you want to run benchmarks for all NDArray Arithmetic Operators, the library will be providing drivers to easily run benchmarks on operators of specific categoriesAs a power user, let us assume, you want to run benchmarks on a float64 tensor instead of a default float32.
NOTE: Similarly, you could also specify the input tensors to use for benchmarking.

Code Block
languagepy
from mxnet_benchmarks.nd import Addrun_all_arithmetic_operations_benchmarks
# Run all Arithmetic operations benchmarks with default input values
add_benchmark = Add(inputs={"dtype": "float64"})
add_benchmark.run_benchmark()
add_benchmark.print_benchmark_results
run_all_arithmetic_operations_benchmarks()

Output for the above benchmark run, on a CPU machine, would look something like below:

Code Block
MX_Add_Forward_Backward_Time - 0.015201 seconds
MX_Multiply_Forward_Backward_Time - 0.021678 seconds
MX_Subtract_Forward_Backward_Time - 0.016154 seconds
MX_Divide_Forward_Backward_Time - 0.024327 seconds
MX_Modulo
Code Block
MX_Add_Forward_Backward_Time - 0.025405 seconds

NOTE: You can print the input parameters used for a benchmark as shown below.

Code Block
from mxnet_benchmarks.nd import Add
# Run all Arithmetic operations benchmarks with default input values
add_benchmark = Add(inputs={"dtype": "float64"})print(add_benchmark.inputs)

Output

Code Block
{'lhs': (1024, 1024), 'rhs': (1024, 1024), 'initializer': <function normal at 0x117b607b8>, 'run_backward': True, 'dtype': 'float64'}

Pros

  1. More control for users - default inputs, random inputs, specific user defined inputs.
  2. Deterministic and better suited for performance benchmarks, reproducibility and CI integration.
  3. With Python interface:
    1. Easy to maintain and develop.
    2. Reflects the performance as seen by the users. (Majority users using Python interface)
    3. Fastest way to get performance tests in place. We do not have any tests in place as of today.
    4. Ability to run and compare benchmarks from other deep learning frameworks.
  4. Extensible:
    1. Can be integrated with property based testing libraries like Hypothesis, to randomly generate test cases with different tensor shapes.

Cons

  1. Need to write base tests for every new operator. If a new operator is added to MXNet, then a new performance test class for the operator needs to be added in this library with default inputs for that new operator to run performance tests.
  2. It is ideal to capture performance close to Kernel. Call from Python operator APIs may hide performance regression when operator computation is small.

Alternate Solution 2 - Autogenerate test with Property Based Testing Technique

(Credits - Thanks to Pedro Larroy for this suggestion)

Approach

  1. Automatically query all operators registered with MXNet engine.
  2. Infer the inputs and outputs for the operators.
  3. Use property based testing technique and library such as Hypothesis to generate random inputs and run the tests.

Pros

  1. Any new operator added to MXNet, will be automatically queried. Hence, no need to write tests explicitly for every operator.
  2. Inputs are randomly generated. Hence, better suited to capture performance regression on corner cases.

Cons

  1. Non deterministic inputs. Hence, better suitable for functionality testing. It will be hard to use this technique for performance tests.
  2. Still requires us to write many custom strategies or conditional property files. Example:
    1. For testing Add operator, we need to set conditions on input to generate same shapes or broadcastable shapes for lhs and rhs.
    2. For Convolution operator, we need to match Kernel, Padding and other parameter shapes appropriately.
  3. Querying operators and inferring the input conditions may be hard and complex logic.
    1. Example: add is an operator, that takes 2 input tensors - lhs, rhs. Now we need to infer that lhs and rhs tensor should of same size or broadcastable. Logic to handle such conditions may soon become complex enough to not give us advantage of auto generated operator benchmarks.
    2. MXNet currently do no support a standard way of querying the registered operators. It would be ideal if MXNet can expose NNVM APIs for querying registered operators and expected inputs, outputs, types and more.
  4. Complex and time consuming. We do not have any operator performance tests for MXNet. It would be ideal to revisit this approach for future enhancement.

Alternate Solution 3 - Extend existing unit tests to cover performance parameters

Development Plan / Milestones

current status

See this repo for more details - https://github.com/sandeep-krishnamurthy/dl-operator-benchmark

  1. 134 operators are supported:
    1. All Gluon Layers - Activation, Loss, Normalization, Basic like Dense, Convolutions, Recurrent (RNN, LSTM, GRU)
    2. NDArray operators like creation, random sampling, arithmetic, logical, comparison etc...
  2. Able to run individual operator benchmarks or use high level drivers to run all tests.
  3. Able to generate results as JSON.
  4. Timing metric - forward only, forward+backward operation.

Development plan

...

045726 seconds
MX_Power_Forward_Backward_Time - 0.077152 seconds
MX_Negative_Forward_Backward_Time - 0.014472 seconds
MX_Inplace_Add_Forward_Time - 0.003824 seconds
MX_Inplace_Subtract_Forward_Time - 0.004137 seconds
MX_Inplace_Multiply_Forward_Time - 0.006589 seconds
MX_Inplace_Division_Forward_Time - 0.003869 seconds
MX_Inplace_Modulo_Forward_Time - 0.018180 seconds

Use Case 3 - Power user - Run benchmarks for specific operator

As a power user, you want to run benchmarks for nd.add operator in MXNet, you just run the following python script.
Note that, we maintain same name and spec as the underlying MXNet operator. For example - to benchmark nd.add, we can use mxnet_benchmarks.nd.Add().

Use CASE 3.1 - Default Inputs for Operators

Code Block
from mxnet_benchmarks.nd import Add
# Run all Arithmetic operations benchmarks with default input values
add_benchmark = Add()
add_benchmark.run_benchmark()
add_benchmark.print_benchmark_results()

Output for the above benchmark run, on a CPU machine, would look something like below:

Code Block
MX_Add_Forward_Backward_Time - 0.015201 seconds

USE CASE 3.2 - Customize Inputs for Operators

As a power user, let us assume, you want to run benchmarks on a float64 tensor instead of a default float32.
NOTE: Similarly, you could also specify the input tensors to use for benchmarking.

Code Block
languagepy
from mxnet_benchmarks.nd import Add
# Run all Arithmetic operations benchmarks with default input values
add_benchmark = Add(inputs={"dtype": "float64"})
add_benchmark.run_benchmark()
add_benchmark.print_benchmark_results()

Output for the above benchmark run, on a CPU machine, would look something like below:


Code Block
MX_Add_Forward_Backward_Time - 0.025405 seconds


NOTE: You can print the input parameters used for a benchmark as shown below.

Code Block
from mxnet_benchmarks.nd import Add
# Run all Arithmetic operations benchmarks with default input values
add_benchmark = Add(inputs={"dtype": "float64"})print(add_benchmark.inputs)

Output


Code Block
{'lhs': (1024, 1024), 'rhs': (1024, 1024), 'initializer': <function normal at 0x117b607b8>, 'run_backward': True, 'dtype': 'float64'}

Pros

  1. More control for users - default inputs, random inputs, specific user defined inputs.
  2. Deterministic and better suited for performance benchmarks, reproducibility and CI integration.
  3. With Python interface:
    1. Easy to maintain and develop.
    2. Reflects the performance as seen by the users. (Majority users using Python interface)
    3. Fastest way to get performance tests in place. We do not have any tests in place as of today.
    4. Ability to run and compare benchmarks from other deep learning frameworks.
  4. Extensible:
    1. Can be integrated with property based testing libraries like Hypothesis, to randomly generate test cases with different tensor shapes.

Cons

  1. Need to write base tests for every new operator. If a new operator is added to MXNet, then a new performance test class for the operator needs to be added in this library with default inputs for that new operator to run performance tests.
  2. It is ideal to capture performance close to Kernel. Call from Python operator APIs may hide performance regression when operator computation is small.

Alternate Solution 2 - Autogenerate test with Property Based Testing Technique

(Credits - Thanks to Pedro Larroy for this suggestion)

Approach

  1. Automatically query all operators registered with MXNet engine.
  2. Infer the inputs and outputs for the operators.
  3. Use property based testing technique and library such as Hypothesis to generate random inputs and run the tests.

Pros

  1. Any new operator added to MXNet, will be automatically queried. Hence, no need to write tests explicitly for every operator.
  2. Inputs are randomly generated. Hence, better suited to capture performance regression on corner cases.

Cons

  1. Non deterministic inputs. Hence, better suitable for functionality testing. It will be hard to use this technique for performance tests.
  2. Still requires us to write many custom strategies or conditional property files. Example:
    1. For testing Add operator, we need to set conditions on input to generate same shapes or broadcastable shapes for lhs and rhs.
    2. For Convolution operator, we need to match Kernel, Padding and other parameter shapes appropriately.
  3. Querying operators and inferring the input conditions may be hard and complex logic.
    1. Example: add is an operator, that takes 2 input tensors - lhs, rhs. Now we need to infer that lhs and rhs tensor should of same size or broadcastable. Logic to handle such conditions may soon become complex enough to not give us advantage of auto generated operator benchmarks.
    2. MXNet currently do no support a standard way of querying the registered operators. It would be ideal if MXNet can expose NNVM APIs for querying registered operators and expected inputs, outputs, types and more.
  4. Complex and time consuming. We do not have any operator performance tests for MXNet. It would be ideal to revisit this approach for future enhancement.

Alternate Solution 3 - Extend existing unit tests to cover performance parameters

<To add more details> In summary, it is hard and complex to modify all unit tests to measure performance along with currently designed way of writing tests which is designed towards - consistency across context, correctness, gradient checks.

Appendix

    1. Our objective is to capture the performance at basic individual operator level.
    2. Symbol APIs is planned to be deprecated soon for users.
    3. Users currently use NDArray operations or use Gluon Layers in imperative (NDArray) or hybrid mode (Symbolic).
    4. In Phase 2 we will be benchmarking Gluon Hybrid Layers individually that should cover the symbolic operations exposed to users.
    5. Also, under the hood, the kernel is same for NDArray and Symbols. hence, we are not missing any tests.

...

Phase 1

Functionality supported:

...