Problem

Although data parallel is used in MXNet, its performance is not good enough for the operators with low computation on the inference, especially for the small batchsize. This phenomena widely exists in many popular models, such as googlenet, wide&deep and inception V3. For example in the wide deep model, 26 embedding OPs are executed in sequence and each one only consumes very little computing resource. So, the model level performance is sub-optimal since the long execution path in the low-parallelism operators.

Goals/Usecases

The primary goal is to improve the inefficient OPs performance by paralleling them in OP level. Any OP comes to/from one OP can be paralleled if it can benefit from high level parallel.
Another goal is that this modification should be transparent to users and should not change existing scripts, models. Active one environment variable will make it works no matter on CPU, GPU etc.
All we need to do is adding a pass for current backend.

Proposed Approach

Figure 1. Example for parallel embedding

Take our wide deep model for example, after split, data flow is divided to 26 and each of them will be handled by single embedding OP. In ordinary process, these 26 embedding OPs will be executed one by one when running inference, and data parallel will be used in its kernel function. Now we replace the 26 OPS using one parallel OP which can handle inference in OP level parallel.

Figure 2. Flowchart for subgraph replace.

Flowchart as Fig.2 shows.

Read current Node
Find parallel structure like Fig. 1 based on current subgraph. We customize SgParallelOpSelector Class inherits SubgraphSelector to do it. If return No, go to step 3. If yes, that is to say find the parallel structure, go to step 4.
Get Next Node and go to step 1.
Filter is used to check whether the finding parallel structure meet some conditions. For example, the paralleled Node number >= threshold or the OP is thread safe or OPs to be paralleled will cause performance drop based on the parameters we got. These conditions parameters can be set by users. If filter fails, return to step 3; else go to step 5.
Replace current selected structure with one Parallel Node, and connect it input/output well based on SgParallelOpProperty Class inherits SubgraphProperty. Then go to step 6.
Check if current Node is the last Node. If so, end and exit. Else, go to step 3.

We implement paralle_op based on subgraph API. The main body of parallel op forward function is accelerate by OMP multithread as Figure3. This means origin OP forward function should be thread safe. As mentioned in step 4, OP whitelist is used to check if OP support thread safe. And whitelist can be add/remove in future by setting environment variables.

Figure 3. Main body of parallel OP forward.

To get the best performance, we need to support nested OMP and fine tune the parameters. In current version, we just simplify it by disable nested OMP. Environment variable may be added to support fine tune the performance in future release.

This method is different from setting environment MXNET_CPU_WORKER_NTHREADS. Using our method, we just do parallelism for special OP, while MXNET_CPU_WORKER_NTHREADS is for all OPs.

Addition of New APIs

No new APIs were added or modified.

Backward compatibility

We add a pass for backend, which have no backward compatibility issue when deactivate. When inactive, we may consider the compatibility for different passes

Performance

In wide and deep model, we replace 26 embedding Ops with one parallel_op as Fig. 1. When we do inference on SKX-8180 1 socket with batch size 1 and OMP thread 28, performance as Table 1 shows. Parallel OP has a 3.7X speedup .

OP	Time cost(ms)
embedding	51240.051
SgParallel_op	13763.959

Table1 performance for Embedding and SgParallel_OP

MKLDNN OPs will be supported since version1.0 that will make Intel MKL-DNN primitives stateless and thread safe: the same primitive can be executed in multiple independent threads as long as different threads use different scratchpads. So we can accelerate more models such inception and googlenet.

Test Plan

Tests need to cover 2 parts. First one is the graph conversion test. We need to ensure that:

Step	Criterion
1	All OPs are partitioned into one or more subgraphs according to executing mode.
2	Desired patterns can be captured and desired paralleled OPs will be created.

Another one is the unit test for OPs in parallel OP whitelist. All these OPs should be thread-safe. The test should cover all supported OPs and make sure they can provide the accurate result.