MXNet Operator Benchmarks

WIP staging repo - https://github.com/sandeep-krishnamurthy/dl-operator-benchmark

Link to dev list discussion

<TODO>

Feature Shepherd

Problem Statement

A deep learning framework like MXNet supports 100s of operators (~250). Benchmarking and profiling a standard neural network and use-case, like ResNet-50 based image classification, is not fully sufficient and does not guarantee to maintain the health and performance of all the supported operators under different settings (Hardware, Accelerator, Data etc...). We need to have an easy to use library/framework to benchmark and profile each operator individually. Having an operator level benchmarks will help us in - fine grained understanding of performance of operators under different settings (Hardware, Accelerator, Data etc...), automated CI/CD performance tests, plan performance optimization and more. In this document, we present a library for MXNet operator benchmark and profiling.

Motivation

A deep learning framework like MXNet supports 100s of operators (~250). Some operators are used as a layers in the neural network (ex: Conv2D), some operators work in combination to form a layer in the neural network (ex: dot, sum => Dense), and many more just used independently outside a neural network (ex: tensor creation/shape change/indexing, logical operators) mostly for data processing and tensor manipulation.

An operator is highly heterogeneous w.r.t supported precision (fp32, fp64, Int64 etc.), accelerators (mkldnn, cuda, cudnn, mxnet native only), different behaviors based on data (ex: broadcast sum behavior on a large square (1024, 1024) tensor is different than on a skewed tensor (10, 10000) and more). Below, we see few areas why we believe operator benchmarks are useful:

Users use a lot more operators that are not part of a standard network like ResNet. Example: Tensor manipulation operators like mean, max, topk, argmax, sort etc.
A standard Network Architecture like ResNet-50 is made up of many operators Ex: Convolution2D, Softmax, Dense, Pooling etc... Observing only the end to end performance can hide individual operator regressions for long time.
We need to know on different hardware infrastructure (Ex: CPU with MKLDNN, GPU with NVIDIA CUDA and cuDNN) how different operators performs. With these details, we can plan the optimization work at operator level, which could exponentially boost up end to end performance.
Operator behavior varies based on different data load:

For example, MXNet's reduction operations works seamlessly with balanced tensor like (1024, 1024), however, performance behavior changes when the input tensor is skewed (1024, 10). Similar observations can be made when comparing Int32 v/s Int64 indexing of Tensor.
See this issue - #14725 which talks about performance regression in FC layer backward pass with CUDA 10 based on input tensor shape - https://github.com/apache/incubator-mxnet/issues/14725#issuecomment-486016229

You want to have nightly performance tests across all operators in a deep learning framework to catch regressions early.
We can integrate this framework with a CI/CD system to run per operator performance tests for PRs. Ex: When a PR modifies the kernel of TransposeConv2D, we can run benchmarks of TransposeConv2D operator to verify performance.

Hence, in this utility, we will build the functionality to allow users and developers of deep learning frameworks to easily run benchmarks for individual operators across varying settings.

Requirements

Benchmarks for Apache MXNet operators.
Common combination (Fused) of operators. Conv + Relu, Conv + BatchNorm.
Individual operator benchmarks to capture - time for operator execution (speed), memory usage.
Fine grained individual operator benchmarks to capture - time for forward pass, time for backward pass and both.
Ability to run operator benchmarks with default inputs or customize with user specific inputs.
Ability to run operator benchmarks on CPU/GPU with different flavors of MXNet (mxnet-mkl, mxnet-cu90mkl etc.)
Benchmarks for operators with varying inputs to uncover any performance issues due to skewed input data. Ex: Measuring operator performance on small input tensors, large input tensors along with average normally used tensor sizes.
Ability to run one, group or all operator benchmarks.
Ability to extract results in multiple usable format - Python Dictionary, JSON, CSV, MD

Design Tenets

Defaults => Common use cases should be extremely easy, customized complex use cases should be possible.

Example: I should be able to run Add operator benchmarks without specifying any inputs and library should provide benchmarks on valid default inputs. At the same time, as a power user, I should be able to provide my own inputs such as Tensor Shapes and context to run the benchmarks.

Minimum Learning Curve => Keep APIs same or close to native NDArray / Gluon Operators being benchmarked.

Example: If I am doing benchmarks on nd.add(lhs, rhs) operator, interface in the benchmark utility should be similar with zero learning curve.

Modular and Reusable
For a programmer or an automated system

Example: Developer using the library or integration with CI/CD

Proposed Approach

This benchmark library will be built on top of MXNet's ND and Gluon interface.
For each operator in ND and Gluon Block, there will be a corresponding Benchmarking operator in the library with default inputs, functionality to process results. See below example for Add operator benchmarks.
High-level drivers are provided to run operator benchmarks in bulk. Example: run_all_mxnet_operator_benchmarks(), run_all_arithmetic_operations_benchmarks() etc.
Results can be generated as a python dictionary/JSON/CSV for upstream system (Ex: CI, Automated Performance Monitoring System) consumption.

class MXNetOperatorBenchmarkBase(ABC):
    """Abstract Base class for all MXNet operator benchmarks.
    """

    def __init__(self, ctx=mx.cpu(), warmup=10, runs=10, default_parameters={}, custom_parameters=None):
        self.ctx = ctx
        self.runs = runs
        self.warmup = warmup
        self.results = {}
        self.inputs = prepare_input_parameters(caller=self.__class__.__name__,
                                               default_parameters=default_parameters,
                                               custom_parameters=custom_parameters)

    @abstractmethod
    def run_benchmark(self):
        pass

    def print_benchmark_results(self):
        if not len(self.results):
            print("No benchmark results found. Run the benchmark before printing results!")
            return

        for key, val in self.results.items():
            print("{} - {:.6f} seconds".format(key, val))

    def get_benchmark_results(self):
        return self.results

class Add(MXNetOperatorBenchmarkBase):
    """Helps to Benchmark Tensor Add operation.

    By default benchmark both forward and backward element_wise tensor addition
    of 1024*1024 tensor of precision - 'float32'.

    """

    def __init__(self, ctx=mx.cpu(), warmup=10, runs=50, inputs=None):
        # Set the default Inputs
        default_parameters = {"lhs": (1024, 1024),
                              "rhs": (1024, 1024),
                              "initializer": nd.normal,
                              "run_backward": True,
                              "dtype": "float32"}

        super().__init__(ctx=ctx, warmup=warmup, runs=runs, default_parameters=default_parameters,
                         custom_parameters=inputs)

        self.lhs = get_mx_ndarray(ctx=self.ctx, in_tensor=self.inputs["lhs"],
                                  dtype=self.inputs["dtype"],
                                  initializer=self.inputs["initializer"],
                                  attach_grad=self.inputs["run_backward"])
        self.rhs = get_mx_ndarray(ctx=self.ctx, in_tensor=self.inputs["rhs"],
                                  dtype=self.inputs["dtype"],
                                  initializer=self.inputs["initializer"],
                                  attach_grad=self.inputs["run_backward"])

    def run_benchmark(self):
        # Warm up, ignore execution time value
        _, _ = nd_forward_backward_and_time(F=nd.add, runs=self.warmup, lhs=self.lhs, rhs=self.rhs)
        # Run Benchmarks
        exe_time, _ = nd_forward_backward_and_time(F=nd.add, runs=self.runs, lhs=self.lhs, rhs=self.rhs)

        self.results["MX_Add_Forward_Backward_Time"] = exe_time / self.runs

Addition of new APIs/Modules

We propose to add this library as a new module (opperf) under incubator-mxnet/benchmark as "incubator-mxnet/benchmark/opperf".

API / User Experience

We can define 2 types of users of the library and describe API interface for each of these users.

General User, Automated Nightly tests

Run benchmarks on all the operators or on specific categories of operators. Use default inputs provided by the library.

Power User, PR validation tests

Run benchmark

USE CASE 1 - Run benchmarks for all the operators

A driver to run all the MXNet operators (NDArray and Gluon) benchmarks with default inputs and saves the final result as JSON in the provided file.

python dl-operator-benchmark/run_all_mxnet_operator_benchmarks.py --output-format json --output-file mxnet_operator_benchmark_results.json

Other Driver Script CLI Options:

output-format : json or md for markdown file output or csv.
ctx : By default, cpu on CPU machine, gpu(0) on GPU machine. You can override and set the global context for all operator benchmarks. Example: --ctx gpu(2).
dtype : By default, float32. You can override and set the global dtype for all operator benchmarks. Example: --dtype float64.

USE CASE 2 - Run benchmarks for all the operators in a specific category

For example, you want to run benchmarks for all NDArray Arithmetic Operators, the library will be providing drivers to easily run benchmarks on operators of specific categories.

from mxnet_benchmarks.nd import run_all_arithmetic_operations_benchmarks
# Run all Arithmetic operations benchmarks with default input values
run_all_arithmetic_operations_benchmarks()

Output for the above benchmark run, on a CPU machine, would look something like below:

MX_Add_Forward_Backward_Time - 0.015201 seconds
MX_Multiply_Forward_Backward_Time - 0.021678 seconds
MX_Subtract_Forward_Backward_Time - 0.016154 seconds
MX_Divide_Forward_Backward_Time - 0.024327 seconds
MX_Modulo_Forward_Backward_Time - 0.045726 seconds
MX_Power_Forward_Backward_Time - 0.077152 seconds
MX_Negative_Forward_Backward_Time - 0.014472 seconds
MX_Inplace_Add_Forward_Time - 0.003824 seconds
MX_Inplace_Subtract_Forward_Time - 0.004137 seconds
MX_Inplace_Multiply_Forward_Time - 0.006589 seconds
MX_Inplace_Division_Forward_Time - 0.003869 seconds
MX_Inplace_Modulo_Forward_Time - 0.018180 seconds

Use Case 3 - Power user - Run benchmarks for specific operator

As a power user, you want to run benchmarks for nd.add operator in MXNet, you just run the following python script.
Note that, we maintain same name and spec as the underlying MXNet operator. For example - to benchmark nd.add, we can use mxnet_benchmarks.nd.Add().

Use CASE 3.1 - Default Inputs for Operators

from mxnet_benchmarks.nd import Add
# Run all Arithmetic operations benchmarks with default input values
add_benchmark = Add()
add_benchmark.run_benchmark()
add_benchmark.print_benchmark_results()

Output for the above benchmark run, on a CPU machine, would look something like below:

MX_Add_Forward_Backward_Time - 0.015201 seconds

USE CASE 3.2 - Customize Inputs for Operators

As a power user, let us assume, you want to run benchmarks on a float64 tensor instead of a default float32.
NOTE: Similarly, you could also specify the input tensors to use for benchmarking.

from mxnet_benchmarks.nd import Add
# Run all Arithmetic operations benchmarks with default input values
add_benchmark = Add(inputs={"dtype": "float64"})
add_benchmark.run_benchmark()
add_benchmark.print_benchmark_results()

Output for the above benchmark run, on a CPU machine, would look something like below:

MX_Add_Forward_Backward_Time - 0.025405 seconds

NOTE: You can print the input parameters used for a benchmark as shown below.

from mxnet_benchmarks.nd import Add
# Run all Arithmetic operations benchmarks with default input values
add_benchmark = Add(inputs={"dtype": "float64"})print(add_benchmark.inputs)

Output

{'lhs': (1024, 1024), 'rhs': (1024, 1024), 'initializer': <function normal at 0x117b607b8>, 'run_backward': True, 'dtype': 'float64'}

Development Plan / Milestones

current status

See this repo for more details - https://github.com/sandeep-krishnamurthy/dl-operator-benchmark

134 operators are supported:

All Gluon Layers - Activation, Loss, Normalization, Basic like Dense, Convolutions, Recurrent (RNN, LSTM, GRU)
NDArray operators like creation, random sampling, arithmetic, logical, comparison etc...

Timing metric - forward only, forward+backward operation.

Development plan

We will be working in 3 Phases as described below.

In Phase 1, we will target to benchmark all important and commonly used MXNet's NDArray and Gluon operators. We defined important and commonly used by users as - Operators part of image, text and graph data processing operations, operations part of standard network architecture such as ResNet, BERT and so on. We will do NDArray operations and Gluon Blocks i.e., all imperative operator benchmarks in phase 1.

NOTES: Why not Symbol Mode in Phase 1?

Our objective is to capture the performance at basic individual operator level.
Symbol APIs is planned to be deprecated soon for users.
Users currently use NDArray operations or use Gluon Layers in imperative (NDArray) or hybrid mode (Symbolic).
In Phase 2 we will be benchmarking Gluon Hybrid Layers individually that should cover the symbolic operations exposed to users.
Also, under the hood, the kernel is same for NDArray and Symbols. hence, we are not missing any tests.

In Phase 2, we will cover the remaining operators left out from phase 1 and also include Gluon Hybrid model this indirectly covers Symbol APIs of prominent operators.

In Phase 3, based on the requirement we will add equivalent PyTorch Operator benchmarks for having neutral baseline for MXNet operator benchmarks.

We will be running benchmarks on CPU (C5.18x) and GPU (P3.2x). For GPU we will be using GPU(0) only. In Phase 1, we will be running for FP32 and dense tensor only. In Phase 2, we will benchmark other Precisions (INT64, FP64 etc...) and also add support for Sparse Tensors.

Phase 1

Hardware

CPU - C5.18X
GPU - P3.2X (Single GPU)

NDArray Operations
1. Copy, CopyTo, as_in_context, asnumpy, asscalar, astype
1. zeros, zeros_like, ones, ones_like, full, arange
1. Transpose (T), shape_array, size_array, reshape, reshape, reshape_like, flatten, expand_dims, split, diag
2. tile, pad
1. sum, nansum, prod, nanprod, mean, max, min, norm
1. sort, argsort, topk, argmax, argmin, argmax_channel
1. add, sub, neg, mul, div, mod, pow
1. iadd (+=), isub (-=), imul (*=), idiv (/=), imod (%=)
1. lesser, lesser_equal, greater, greater_equal, equal, not_equal
1. get_item (x[i]), set_item (x[i]=)
2. slice, slice_axis, take, batch_take, pick
3. one_hot
1. exp, log
1. sqrt, square
1. concat, split, stack
1. dot, batch_dot
1. normal, poisson, uniform, random, randn, randint
2. shuffle
1. clip
2. where
3. abs

Precision - FP32
Conversion Operations
Creation Operations
Shape (view) change Operations
Reduction Operations
Sorting and Searching Operations
Arithmetic Operations
Inplace Arithmetic Operations
Comparison Operations
Indexing Operations
Exponents and Logarithms
Powers Operations
Join and Split Operations
GEMM
Random Sampling
Others
Neural Network Operations

Gluon Layers (Neural Network Operations)
1. Dense, Lambda, Flatten, Embedding
2. Dropout, BatchNorm
1. Conv1D, Conv2D
2. Conv1DTranspose, Conv2DTranspose
1. MaxPool1D, MaxPool2D, AvgPool1D, AvgPool2D, GlobalMaxPool1D, GlobalMaxPool2D, GlobalAvgPool1D, GlobalAvgPool2D
1. LeakyRelu, PRelu, Sigmoid, Softmax, Log_Softmax, Activation
1. RNNCell, LSTMCell, GRUCell, RecurrentCell, SequentialRNNCell, BiDirectionalCell
1. L1Loss, L2Loss, SigmoidBinaryCrossEntropyLoss, SoftmaxCrossEntropyLoss, KLDivLoss, HuberLoss, HingeLoss, SquaredHingeLoss, LogisticLoss, TripletLoss, CTCLoss

Modes: Imperative (Hybrid will be added in Next Phase)
Basic
Convolutions
Pooling
Activations
Recurrent Cells
Loss

Custom Operator Benchmark

Phase 2

Other DataTypes

INT64, INT8, FP64, FP16

NDArray Operations
1. tostype
1. swapaxes, flip, depth_to_space, space_to_depth
1. round, rint, fix, floor, ceil, trunc
1. sin, cos, tan, arcsin, arccos, arctan, degrees, radians
2. sinh, cosh, tahnh, arcsinh, arccosh, arctanh
1. expm1, log10, log2, log1p
1. logical_and, logical_or, logical_xor, logical_not
1. rsqrt, cbrt, rcbrt, reciprocal
1. exponential, gamma, generalized_negative_binomial, multinomial, negative_binomial,
1. SequenceLast, SequenceMask, SequenceReverse
1. unravel_index, ravel_multi_index

Sparse
View Operations
Rounding Operations
Trigonometric Operations
Exponent and Logarithmic Operations
Logical Operations
Powers Operations
Random Operations
Sequence Operations
Others

Gluon
1. Conv + Relu, Conv + BatchNorm (More to be added when we start this work)
1. HybridLambda
2. InstanceNorm, LayerNorm
1. Conv3D
2. Conv3DTranspose
1. MaxPool3D, AvgPool3D, GlobalMaxPool3D, GlobalAvgPool3D
1. , Elu, Selu, Swish
1. ZoneOutCell, ResidualCell, DropoutCell
1. CosineEmbeddingLoss, PoissonNLLoss

Mode: Hybrid Mode for all layers covered in Phase 1. Additional coverage of layers as below.
Fused Operators
Basic
Convolutions
Pooling
Activations
Recurrent
Loss
Important Contrib Layers

Other Items to be explored

Image APIs
Data APIs
Metric APIs
Initializers and Optimizers

Phase 3

Benchmark PyTorch Operators to have neutral baseline for comparing MXNet operator performance. This is yet to be discussed and finalized.

FAQs

Q1) Why not use check_speed(..) utility in MXNet test_util?
A) Supports Symbol APIs only. Do not support benchmarking NDArray, Gluon Blocks. It is a lightweight simple symbol executor and expects users to create symbol graph to execute along with inputs. The proposed library in this document is more sophisticated by supporting benchmarking operators with NDArray, Gluon Block, provide various default inputs, provide high-level drivers, provide interface for users to specify different inputs, prepare results in different formats - Python dictionary, CSV, JSON.

Q2) Why not Symbol execution? Why only NDArray and Gluon?
A) MXNet users are encouraged and are mainly using NDArray APIs or Gluon APIs. MXNet community is moving towards deprecating symbol APIs with works on numpy compatible operators. To measure what our users use and observe, we propose to use NDArray and Gluon blocks for benchmarking operators in this library.

Page tree

MXNet Operator Benchmarks

Link to dev list discussion

Feature Shepherd

Problem Statement

Motivation

Requirements

Design Tenets

Proposed Approach

Addition of new APIs/Modules

API / User Experience

USE CASE 1 - Run benchmarks for all the operators

USE CASE 2 - Run benchmarks for all the operators in a specific category

Use Case 3 - Power user - Run benchmarks for specific operator

Use CASE 3.1 - Default Inputs for Operators

USE CASE 3.2 - Customize Inputs for Operators

Development Plan / Milestones

current status

Development plan

Phase 1

Phase 2

Phase 3

FAQs