...
Below graph demonstrates the design and execution flow of the fused RNN operator in MXNet. Green blocks have already been implemented for NVidia GPU with CuDNN interfaces. Yellow blocks are recently integrated in by PR#9977 just for LSTM inference path. Blue blocks are going to be added for extending PR#9977 to training and other RNN variants. Currently, PR#10104 is submitted for fused LSTM implementation and PR#10311 for fused GRU implementation. Vanilla RNN is planned and will be provided in the future.
Operator Registration
Currently, `sym.RNN` is registered into MXNet with legacy DMLC interfaces. We are trying to refactor this part of code with NNVM interfaces. Regarding to the design of NNVM registration, the operator creating, caching and workspace sharing should be redesigned for passing these information between forward and backward path and among iterations.
...
As described above, reusing forward intermediate results during backward computation will reduce the computation amount and improve backward performance a lot. A reserved workspace buffer is defined as a private member of RNNOp class. This buffer will store the intermediate results during forward computation and be reused during backward computation. If the operator instance is cached and reused in other iterations, this workspace buffer will also be reused. Workspace buffer will be released when the operator instance is destructed.
Performance
LSTM
To demonstrate the performance of LSTM layer with our fused RNN operator, we leverage those sizes and parameters for Deep Speech 2 model, in which:
seq_length = 300, batch_size = 20, input_size = 800, hidden_size = 800, single direction.
...
Single layer and single direction test as below:
T, N, I, H = 300, 20, 800, 800 layer=1 bidirection=False samples/sec | SKX8180 | P100 | Speedup | |||||
LSTMCell | sym.RNN | sym.RNN/LSTMCell | LSTMCell | sym.RNN | sym.RNN(8180)/sym.RNN(P100) | sym.RNN(8180)/LSTMCell(P100) | ||
Non-FusedRNN | FusedRNN | FusedRNN/Non-FusedRNN | ||||||
LSTM-Inference | 187.09 | 394.73 | 210. | 99%399.76 | 1049.813658 | 37.60% | 98.88% | 98% |
LSTM-Training(fwd+bwd) | 73.23 | 153.53 | 209.65% | 118.65 | 339.80665 | 45.18% | 129.4% |
For 5-layer LSTM, we can get below performance and speedup on SKX8180 with 2 sockets:
GRU-Inference | 128.21 | 392.16 | 305.87% | ||||
GRU-Training(fwd+bwd) | 80.32 | 171.91 | 214.03% | ||||
vRNN(Relu)-Inference | 518.13 | 1538.46 | 296.92% | ||||
vRNN(Relu)-Training(fwd+bwd) | 202.02 | 357.14 | 176.79% | ||||
vRNN(Tanh)-Inference | 492.61 | 952.38 | 193.33% | ||||
vRNN(Tanh)- | |||||||
samples/sec | SKX8180 | P100 | Speedup | ||||
LSTMCell | sym.RNN | sym.RNN/LSTMCell | LSTMCell | sym.RNN | sym.RNN(8180)/sym.RNN(P100) | sym.RNN(8180)/LSTMCell(P100) | |
Inference | 37.24 | 107.13 | 287.64% | 86.45 | 329.78 | 32.48% | 123.92% |
Training(fwd+bwd) | 12198.9302 | 32318.2998 | 249161.66% | 25.45 | 124.13 | 26.01% | 126.85% |
GRU
Same with LSTM benchmark, sizes and parameters for GRU layer are also from DS2 model:
seq_length = 300, batch_size = 20, input_size = 800, hidden_size = 800, single direction
Single layer performance on SKX8180 with 2 sockets:
08% |
5-layers (vRNN/LSTM/GRU) with bi-direction test as below:
T, N, I, H = 300, 20, 800, 800 layer=1 bidirection=False samples/sec | SKX8180 | Speedup | |||||
Non-FusedRNN | FusedRNN | FusedRNN/Non-FusedRNN | |||||
LSTM-Inference | 37.24 | 107.13 | 287.67% | ||||
LSTM-Training(fwd+bwd) | 12.93 | 32.29 | 249.73% | ||||
GRU-Inference | 26.67 | 88.9 | 333.33% | ||||
GRU- | |||||||
samples=20 | SKX-8180 | P100 | Speedup | ||||
---|---|---|---|---|---|---|---|
samples/sec | GRUCell | sym.RNN | sym.RNN/GRUCell | GRUCell | sym.RNN | sym.RNN(8180)/sym.RNN(P100) | sym.RNN(8180)/GRUCell(P100) |
Inference | 128.21 | 392.16 | 306% | 180.18 | 952.38 | 41% | 218% |
Training(fwd+bwd) | 8015.3204 | 17139.912 | 216% | 126260.58 | 338.98 | 51% | 137% |
For 5-layer GRU, performance on SKX8180 with 2 sockets:
64% | |||||||
vRNN(Relu)-Inference | 40.73 | 134.23 | 329.53% | ||||
vRNN(Relu)-Training(fwd+bwd) | 22.60 | 35.97 | 159.17% | ||||
vRNN(Tanh)-Inference | 38.91 | 104.17 | 267.71% | ||||
vRNN(Tanh)- | |||||||
samples=20 | SKX-8180 | P100 | Speedup | ||||
---|---|---|---|---|---|---|---|
samples/sec | GRUCell | sym.RNN | sym.RNN/GRUCell | GRUCell | sym.RNN | sym.RNN(8180)/sym.RNN(P100) | sym.RNN(8180)/GRUCell(P100) |
Inference | 26.67 | 88.9 | 333% | 40.57 | 357.14 | 25% | 219% |
Training(fwd+bwd) | 1522.0473 | 3934.201 | 261% | 27.62 | 140.85 | 28% | 142% |
Upstream
149.66% |
Upstream
- LSTM, PR#10104, GRU PR#10311, vRNN PR#11399 : Merged
- PR#10104: This PR is for fused LSTM operator which supports multi-layer and bidirectional computation too. Code is done and ready for review. When we try to refactor code, including CuDNN implementation, with NNVM interfaces, a segfault is observed in MXNet CI environment. The error cannot be reproduced on our local server. But seems it is caused by the memory sharing mechanism between forward and backward computation. So we removed NNVM interfaces from this PR and keep both CPU path and GPU path with legacy registration method.
- PR#10311: This PR is for fused GRU operator. Multi-layer and bidirectional support is also implemented for fused GRU operator. This PR's review and merging depend on the progess of #10104.
- TODOs: Vanilla RNN support is still WIP.
MKL-DNN Integration
Intel MKL-DNN is an open source performance library for deep learning applications. The library accelerates deep learning applications and frameworks on Intel architecture. Recently, MKL-DNN has added RNN primitives to its master branch on GitHub. The RNN primitives are still experimental and don't have good enough performance. MKL-DNN team is collecting user experience suggestions and continue improving the performance of these primitives. Currently, vanilla RNN, LSTM and GRU, as well as their bidirectional and multi-layer computation, are supported by MKL-DNN.
...