Page History

...

A deep learning framework like MXNet supports 100s of operators (~250). Some operators are used as a layers layer in the neural network (ex: Conv2D), some operators work in combination to form a layer in the neural network (ex: dot, sum => Dense), and many more just used independently outside a neural network (ex: tensor creation/shape change/indexing, logical operators) mostly for data processing and tensor manipulation.

An operator is highly heterogeneous w.r.t supported precision (fp32, fp64, Int64 etc.), accelerators (mkldnn, cuda, cudnn, mxnet native only), different behaviors based on data (ex: broadcast sum behavior on a large square (1024, 1024) tensor is different than on a skewed tensor (10, 10000) and more). Below, we see few areas why we believe operator benchmarks are useful:

Users use a lot more operators that are not part of a standard network like ResNet. Example: Tensor manipulation operators like mean, max, topk, argmax, sort etc.
A standard Network Architecture like ResNet-50 is made up of many operators Ex: Convolution2D, Softmax, Dense, Pooling etc... Observing only the end to end performance can hide individual operator regressions for long time.
We need to know on different hardware infrastructure (Ex: CPU with MKLDNN, GPU with NVIDIA CUDA and cuDNN) how different operators performsperform. With these details, we can plan the optimization work at operator level, which could exponentially boost up the end to end performance.
Operator behavior varies based on different data load:

For example, MXNet's reduction operations works work seamlessly with balanced tensor like (1024, 1024), however, performance behavior changes when the input tensor is skewed (1024, 10). Similar observations can be made when comparing Int32 v/s Int64 indexing of Tensor.
See this issue - #14725 which talks about performance regression in FC layer backward pass with CUDA 10 based on input tensor shape - https://github.com/apache/incubator-mxnet/issues/14725#issuecomment-486016229

You want to have nightly performance tests across all operators in a deep learning framework to catch regressions early.
We can integrate this framework with a CI/CD system to run per operator performance tests for PRs. Ex: When a PR modifies the kernel of TransposeConv2D, we can run benchmarks of TransposeConv2D operator to verify performance.

...

Page tree

Versions Compared

Old Version 4

New Version 5

Key