Problem

As we know, many low computation OPs running inference in one thread is inefficiency. Although data parallel is used in MXNet, its performance is not efficient for machine resources are not fully used. This phenomena wildly exists in many popular models, such as googlenet, wide&deep and inception V3. In our wide deep model, 26 embedding OPs cost much more time due to the overhead when running inference.

Goals/Usecases

The primary goal is to improve the inefficient OPs performance by paralleling them in OP level. Any OP comes to/from one OP can be paralleled if it can benefit from high level parallel.
Another goal is that this modification should be transparent to users and should not change existing scripts, models. Active one environment variable will make it works no matter on CPU, GPU etc.
All we need to do is adding a pass for current backend.

Proposed Approach

Figure 1. Example for parallel embedding

Take our wide deep model for example, after split, data flow is divided to 26 and each of them will be handled by single embedding OP. In ordinary process, these 26 embedding OPs will be executed one by one when running inference, and data parallel will be used in its kernel function. Now we replace the 26 OPS using one parallel OP which can handle inference in OP level parallel.

Figure 2. Flowchart for subgraph replace.

Flowchart as Fig.2 shows.

Read current Node
Find parallel structure like Fig. 1 based on current subgraph. We customize SgParallelOpSelector Class inherits SubgraphSelector to do it. If return No, go to step 3. If yes, that is to say find the parallel structure, go to step 4.
Get Next Node and go to step 1.
Filter is used to check whether the finding parallel structure meet some conditions. For example, the paralleled Node number >= threshold or the OP is thread safe or OPs to be paralleled will cause performance drop based on the parameters we got. These conditions parameters can be set by users. If filter fails, return to step 3; else go to step 5.
Replace current selected structure with one Parallel Node, and connect it input/output well based on SgParallelOpProperty Class inherits SubgraphProperty. Then go to step 6.
Check if current Node is the last Node. If so, end and exit. Else, go to step 3.

We implement paralle_op based on subgraph API. The main body of parallel op forward function is accelerate by OMP multithread as Figure3. This means origin OP forward function should be thread safe. As mentioned in step 4, OP whitelist is used to check if OP support thread safe. And whitelist can be add/remove in future by setting environment variables.

Figure 3. Main body of parallel OP forward.

To get the best performance, we need to support nested OMP and fine tune the parameters. In current version, we just simplify it by disable nested OMP. Environment variable may be added to support fine tune the performance in future release.

This method is different from setting environment MXNET_CPU_WORKER_NTHREADS. Using our method, we just do parallelism for special OP, while MXNET_CPU_WORKER_NTHREADS is for all OPs.

Page tree

Enable Operator Level Parallelism under Subgraph

Problem

Goals/Usecases

Proposed Approach