Problem

As we know, many low computation OPs running inference in one thread is inefficiency. Although data parallel is used in MXNet, its performance is not efficient for machine resources are not fully usedgood enough for the operators with low computation on the inference, especially for the small batchsize. This phenomena wildly exists in many popular models, such as googlenet, wide&deep and inception V3. In our For example in the wide deep model, 26 embedding OPs cost much more time due to the overhead when running inferenceare executed in sequence and each one only consumes very little computing resource. So, the model level performance is sub-optimal since the long execution path in the low-parallelism operators.

Goals/Usecases

The primary goal is to improve the inefficient OPs performance by paralleling them in OP level. Any OP comes to/from one OP can be paralleled if it can benefit from high level parallel.
Another goal is that this modification should be transparent to users and should not change existing scripts, models. Active one environment variable will make it works no matter on CPU, GPU etc.
All we need to do is adding a pass for current backend.

...

Page tree

Versions Compared

Old Version 3

New Version 4

Key

Problem

Goals/Usecases

Page tree

Page History

Versions Compared

Old Version 3

New Version 4

Key

Problem

Goals/Usecases