Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
  • Credit to Zhennan for this proposal (smile)

Problem

Although data parallel is used in MXNet, its performance is not good enough for the less computational intensive operators in the inference stage, especially for the small batchsize. This phenomena widely exists in many popular models, such as googlenet, wide deep and inception v3. For example in the wide deep model, 26 embedding OPs are executed in sequence and each one only consumes very little computing resource. So, the model level performance is sub-optimal since the long execution path in the low-parallelism operators.

...

As Fig.2 shown, we implement the whole workflow based on subgraph API. `SgParallelOpSelector` SgParallelOpSelector inherited from `SubgraphSelector` SubgraphSelector is used to find the parallel structure, and `SgParallelOpProperty` inherited from `SubgraphProperty` is SgParallelOpProperty inherited from SubgraphProperty is to connect its input/output entry.

The key bock in Fig.2 is Filter which is used check whether the finding parallel structure meet some conditions. For example, we must make sure OP the metrics. It must guarantee that the operator is thread safe or ; otherwise, it may fails during simultaneous execution by multiple threads.  From MKL-DNN OP 1.0 all MKLDNN operators will be thread safe after version 1.0and can be executed in parallel. But now, we need to maintain a whitelist for thread safe OPsoperators. There are some other conditions which used to fine tune the performance such as paralleled Node number >= threshold or OPs to be paralleled will  will cause performance drop based on the parameters we got. Environment variable may be add by user to add/remove whitelists in future release.

The main body of parallel op forward function is accelerate by OMP multithread as Figure3. When do the inference, several op runs operators run parallel. In our Such as in the wide deep model, 26 embedding forward function are called simultaneously. By this parallel in OP level, performance is improved a lot.

...