Credit to Zhennan for this proposal

Problem

As we know, many low computation OPs running inference in one thread is inefficiency. Although data parallel is used in MXNet, its performance is not efficient for machine resources are not fully usedgood enough for the less computational intensive operators in the inference stage, especially for the small batchsize. This phenomena wildly widely exists in many popular models, such as googlenet, wide & deep and inception V3. In our v3. For example in the wide deep model, 26 embedding OPs cost much more time due to the overhead when running inferenceare executed in sequence and each one only consumes very little computing resource. So, the model level performance is sub-optimal since the long execution path in the low-parallelism operators.

Goals/Usecases

The primary goal is to improve the inefficient OPs performance by paralleling them in OP level. Any OP inefficient and independent OPs. In this proposal, only the situation that OPs comes to/from one OP can be paralleled if it can benefit from high level parallel.
Another goal is that this modification should be is covered and other hierarchical patterns will be considered in the future. The change in this proposal will grantee that the modification is transparent to users and should does not change existing scripts, models. Active one environment variable will make it works no matter on CPU, GPU etc.
All we need to do is adding a pass for current backend

The approach can work for all backends by sharing the same subgraph path. But in practice some adjusts in the interfaces and implementations are still needed, like the difference in the hardware mapping where the CPU can assign the OP to different cores and GPU needs to multiple stream. Thus, in the first step, the CPU and MKLDNN backend are enabled.

Proposed Approach

Figure 1. Example for parallel embedding Image Modified

Figure 1. Example for parallel embedding

Take the Take our wide deep model for example, after split, data flow is divided to 26 and each of them will be handled by single embedding OP. In ordinary process, these 26 embedding OPs will be executed one by one when running inference, and data parallel will be used in its kernel function. Now we replace the 26 OPS using one parallel OP which can handle inference in OP level parallel.

Figure 2. Flowchart for subgraph replace. Image Modified

Flowchart as Fig.2 shows.

...

Figure 2. Flowchart for subgraph replace.

As Fig.2 shown, we implement the whole workflow based on subgraph API. SgParallelOpSelector inherited from SubgraphSelector is used to

...

find the parallel structure,

...

and SgParallelOpProperty inherited from SubgraphProperty is to connect its input/output entry.

The key bock in Fig.2 is Filter which is used check whether the finding parallel structure meet the metrics. It must guarantee that the operator is thread safe; otherwise, it may fails during simultaneous execution by multiple threads. From MKL-DNN 1.0 all MKLDNN operators will be thread safe and can be executed in parallel. But now, we need to maintain a whitelist for thread safe operators. There are some

...

other conditions which used to fine tune the performance such as paralleled Node number >= threshold

...

will cause performance drop

...

. Environment variable may be add by user to add/remove whitelists in future release.

...

We implement paralle_op based on subgraph API. The main body of parallel op forward function is accelerate by OMP multithread as Figure3. This means origin OP forward function should be thread safe. As mentioned in step 4, OP whitelist is used to check if OP support thread safe. And whitelist can be add/remove in future by setting environment variables. Figure 3. Main body of parallel OP forward. Image Removed. When do the inference, several operators run parallel. Such as in the wide deep model, 26 embedding forward function are called simultaneously. By this parallel in OP level, performance is improved a lot.

Figure 3. Main body of parallel OP forward. Image Added

Figure 3. Main body of parallel OP forward.

To get the best performance, we need to support nested OMP and fine tune the parameters. In current version, we just simplify it by disable nested OMP. Environment variable may be added to support fine tune the performance in future release.

...

Support structure as Fig.1.
Support structure as Fig.4. In this Fig, all OPs to be replaced has output to OP
Image Modified
Figure 4. Replace Ops come to one OP

3. Support structure as Fig.5. In this Fig, all OPs to be replaced has input from OP X.

Image Modified

Figure 5. Replace Ops come from one OP

...

Page tree

Versions Compared

Old Version 2

New Version Current

Key

Credit to Zhennan for this proposal

Problem

Problem

Goals/Usecases

Proposed Approach

Page tree

Page History

Versions Compared

Old Version 2

New Version Current

Key

Credit to Zhennan for this proposal

Problem

Problem

Goals/Usecases

Proposed Approach