Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Although data parallel is used in MXNet, its performance is not good enough for the operators with low computation on less computational intensive operators in the inference stage, especially for the small batchsize. This phenomena widely exists in many popular models, such as googlenet, wide & deep and inception V3v3. For example in the wide deep model, 26 embedding OPs are executed in sequence and each one only consumes very little computing resource. So, the model level performance is sub-optimal since the long execution path in the low-parallelism operators.

...

The primary goal is to improve the performance by paralleling inefficient and independent OPs. In this proposal, only the situation that OPs comes to/from one OP is covered and other hierarchical patterns will be considered in the future. The change in this proposal will grantee that the modification is transparent to users and does not change existing scripts, models.

This The approach can work for all backends by sharing the same subgraph path but in practice some adjusts in the interfaces and implementations are still needed. Thus, in the first step, the CPU and MKLDNN backend are enabled.

...