Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The primary goal is to improve the inefficient OPs performance by paralleling them in OP level. Any OP comes to/from one OP can be paralleled if it can benefit from high level parallel.
Another goal is that this modification should be transparent to users and should not change existing scripts, models. Active one environment variable will make it works no matter on CPU, GPU etc.
All we need to do is adding a pass for current backend.

Proposed Approach

Figure 1. Example for parallel embeddingImage Modified

Figure 1. Example for parallel embedding

Take our wide deep model for example, after split, data flow is divided to 26 and each of them will be handled by single embedding OP. In ordinary process, these 26 embedding OPs will be executed one by one when running inference, and data parallel will be used in its kernel function. Now we replace the 26 OPS using one parallel OP which can handle inference in OP level parallel. 

Figure 2. Flowchart for subgraph replace.Image Modified

Figure 2. Flowchart for subgraph replace.

Flowchart as Fig.2 shows.

...

We implement paralle_op based on subgraph API. The main body of parallel op forward function is accelerate by OMP multithread as Figure3. This means origin OP forward function should be thread safe. As mentioned in step 4, OP whitelist is used to check if OP support thread safe. And whitelist can be add/remove in future by setting environment variables.

Figure 3. Main body of parallel OP forward.Image Modified

To get the best performance, we need to support nested OMP and fine tune the parameters. In current version, we just simplify it by disable nested OMP. Environment variable may be added to support fine tune the performance in future release.

...

  1. Support structure as Fig.1.
  2. Support structure as Fig.4. In this Fig, all OPs to be replaced has output to OP
    Image Modified

    Figure 4. Replace Ops come to one OP

3. Support structure as Fig.5. In this Fig, all OPs to be replaced has input from OP X.

Image Modified

Figure 5. Replace Ops come from one OP

...