https://docs.google.com/document/d/1XC_PmbSc7q6px22LIW3vwhbA_wmX8wRGLRnet3pMJrs/edit?usp=sharing
Problem Statement
Currently, RNN layers are widely used in neural networks for NLP and Seq2Seq learning domain, because of its outstanding ability on handling temporal dependencies. MXNet has already implemented feature-rich and flexible RNN layers with which end users can build their NLP and Seq2Seq models more easily. Besides that, MXNet also provides fused RNN operator for users who care more about performance or have more fixed RNN architectures. But unfortunately, in MXNet, fused RNN operator is only implemented for GPU with CuDNN interfaces. This will cause several problems for MXNet users and developers:
...
As described above, reusing forward intermediate results during backward computation will reduce the computation amount and improve backward performance a lot. A reserved workspace buffer is defined as a private member of RNNOp class. This buffer will store the intermediate results during forward computation and be reused during backward computation. If the operator instance is cached and reused in other iterations, this workspace buffer will also be reused. Workspace buffer will be released when the operator instance is destructed.
Performance
LSTM
To demonstrate the performance of LSTM layer with our fused RNN operator, we leverage those sizes and parameters for Deep Speech 2 model, in which:
seq_length = 300, batch_size = 20, input_size = 800, hidden_size = 800, single direction.
...
Single layer and single direction test as below:
T, N, I, H = 300, 20, 800, 800 layer=1 bidirection=False samples/sec | SKX8180 | P100 | Speedup | |||||
LSTMCell | sym.RNN | sym.RNN/LSTMCell | LSTMCell | sym.RNN | sym.RNN(8180)/sym.RNN(P100) | sym.RNN(8180)/LSTMCell(P100) | ||
Non-FusedRNN | FusedRNN | FusedRNN/Non-FusedRNN | ||||||
LSTM-Inference | 187.09 | 394.73 | 210. | 99%399.76 | 1049.813658 | 37.60% | 98.88% | 98% |
LSTM-Training(fwd+bwd) | 73.23 | 153.53 | 209.65% | 118.65 | 339.80665 | 45.18% | 129.4% |
For 5-layer LSTM, we can get below performance and speedup on SKX8180 with 2 sockets:
GRU-Inference | 128.21 | 392.16 | 305.87% | ||||
GRU-Training(fwd+bwd) | 80.32 | 171.91 | 214.03% | ||||
vRNN(Relu)-Inference | 518.13 | 1538.46 | 296.92% | ||||
vRNN(Relu)-Training(fwd+bwd) | 202.02 | 357.14 | 176.79% | ||||
vRNN(Tanh)-Inference | 492.61 | 952.38 | 193.33% | ||||
vRNN(Tanh)- | |||||||
samples/sec | SKX8180 | P100 | Speedup | ||||
LSTMCell | sym.RNN | sym.RNN/LSTMCell | LSTMCell | sym.RNN | sym.RNN(8180)/sym.RNN(P100) | sym.RNN(8180)/LSTMCell(P100) | |
Inference | 37.24 | 107.13 | 287.64% | 86.45 | 329.78 | 32.48% | 123.92% |
Training(fwd+bwd) | 12198.9302 | 32318.2998 | 249161.66% | 25.45 | 124.13 | 26.01% | 126.85% |
GRU
Same with LSTM benchmark, sizes and parameters for GRU layer are also from DS2 model:
seq_length = 300, batch_size = 20, input_size = 800, hidden_size = 800, single direction
Single layer performance on SKX8180 with 2 sockets:
08% |
5-layers (vRNN/LSTM/GRU) with bi-direction test as below:
T, N, I, H = 300, 20, 800, 800 layer=1 bidirection=False samples/sec | SKX8180 | Speedup | |||||
Non-FusedRNN | FusedRNN | FusedRNN/Non-FusedRNN | |||||
LSTM-Inference | 37.24 | 107.13 | 287.67% | ||||
LSTM-Training(fwd+bwd) | 12.93 | 32.29 | 249.73% | ||||
GRU-Inference | 26.67 | 88.9 | 333.33% | ||||
GRU- | |||||||
samples=20 | SKX-8180 | P100 | Speedup | ||||
---|---|---|---|---|---|---|---|
samples/sec | GRUCell | sym.RNN | sym.RNN/GRUCell | GRUCell | sym.RNN | sym.RNN(8180)/sym.RNN(P100) | sym.RNN(8180)/GRUCell(P100) |
Inference | 128.21 | 392.16 | 306% | 180.18 | 952.38 | 41% | 218% |
Training(fwd+bwd) | 8015.3204 | 17139.912 | 216% | 126260.58 | 338.98 | 51% | 137% |
For 5-layer GRU, performance on SKX8180 with 2 sockets:
64% | |||||||
vRNN(Relu)-Inference | 40.73 | 134.23 | 329.53% | ||||
vRNN(Relu)-Training(fwd+bwd) | 22.60 | 35.97 | 159.17% | ||||
vRNN(Tanh)-Inference | 38.91 | 104.17 | 267.71% | ||||
vRNN(Tanh)- | |||||||
samples=20 | SKX-8180 | P100 | Speedup | ||||
---|---|---|---|---|---|---|---|
samples/sec | GRUCell | sym.RNN | sym.RNN/GRUCell | GRUCell | sym.RNN | sym.RNN(8180)/sym.RNN(P100) | sym.RNN(8180)/GRUCell(P100) |
Inference | 26.67 | 88.9 | 333% | 40.57 | 357.14 | 25% | 219% |
Training(fwd+bwd) | 1522.0473 | 3934.201 | 261% | 27.62 | 140.85 | 28% | 142%149.66% |
Upstream
...