Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

PR: https://github.com/apache/incubator-mxnet/pull/15210

...

How Custom Operators Work

MXNet allows users to create custom operators if the existing NDArray operators cannot meet their needs. However, profiling operator is not well supported currently.

...


![image](https://user-images.githubusercontent.com/16669457/59465512-ede5e200-8ddf-11e9-9416-cc53409263be.png)
Also in CustomOperator’s Push(), a special callback named “CustomOperator”(Now renamed to “Dummy_Wait”, we will also use this name below) is pushed to the engine. The idea is that “CustomOperator” have dependencies on the custom operator and it will get executed at last to make sure the custom operator event will span over the execution of both the pure python code as well as the sub-operators.###

Issues and Motivation

With the above said, there are some issues with the current profiler:

...

To avoid confusion, those issues need to be fixed.###

New Design

...

Main changes

...

Regarding custom operators, users care about the performance of both the pure python code and the sub-operators. So, in our enhanced custom operator profiling, we should dissect custom operator calls into fine-gained events for both categories. Specifically, we will create a new domain called “custom operators”. There, we will have: 1) Events that represent the execution of the pure python code. 2) Events that represent the execution of the sub-operators. 3) Also, for different custom operators, we should give events different namespace prefix.

...

I have created a pull request at: https://github.com/apache/incubator-mxnet/pull/15210.**

More discussions

...

With this enhanced custom operator profiling, we also want to get rid of profiling “Dummy_Wait” entirely. This is done by adding a check in ProfileOperator in profiler.h.

Notice that because we are adding a function call to GenerateDisplayName() in PushAsync(), we are risking adding an overhead to every operator call (we need to get thread id and and the function has a lock). However in practice, because this function is short and has early return checks, this overhead is small enough to forgive. On my machine (2017 MacBook Pro 13’ i7), on average, for regular operator calls, this overhead is less than 1 micro second (it appears as 0). And for sub-operator calls, the overhead is always < 10 micro seconds and averages to < 5 micro seconds. This is to be compared to ~150 micro seconds taken by scalar addition on a 100*100 matrix. Notice this relative larger overhead will only happen to sub-operators of custom operators.###

Visualization

Below is the * new visualization* after my change:
![image](https://user-images.githubusercontent.com/16669457/59465592-1b329000-8de0-11e9-8e86-8a3cb70cd7b9.png)
This is to be compared with the * old visualization*:
![image](https://user-images.githubusercontent.com/16669457/59465613-22f23480-8de0-11e9-9ffe-59e007187827.png)
The two screenshot are produced with the same *code*:
```
import mxnet as mx
from mxnet import nd
from mxnet import profiler
from mxnet import context
import threading

...