https://docs.google.com/document/d/1XC_PmbSc7q6px22LIW3vwhbA_wmX8wRGLRnet3pMJrs/edit?usp=sharing

Problem Statement

Currently, RNN layers are widely used in neural networks for NLP and Seq2Seq learning domain, because of its outstanding ability on handling temporal dependencies. MXNet has already implemented feature-rich and flexible RNN layers with which end users can build their NLP and Seq2Seq models more easily. Besides that, MXNet also provides fused RNN operator for users who care more about performance or have more fixed RNN architectures. But unfortunately, in MXNet, fused RNN operator is only implemented for GPU with CuDNN interfaces. This will cause several problems for MXNet users and developers:

...

Below graph demonstrates the design and execution flow of the fused RNN operator in MXNet. Green blocks have already been implemented for NVidia GPU with CuDNN interfaces. Yellow blocks are recently integrated in by PR#9977 just for LSTM inference path. Blue blocks are going to be added for extending PR#9977 to training and other RNN variants. Currently, PR#10104 is submitted for fused LSTM implementation and PR#10311 for fused GRU implementation. Vanilla RNN is planned and will be provided in the future.Image Removed

Image Added

Operator Registration

Currently, `sym.RNN` is registered into MXNet with legacy DMLC interfaces. We are trying to refactor this part of code with NNVM interfaces. Regarding to the design of NNVM registration, the operator creating, caching and workspace sharing should be redesigned for passing these information between forward and backward path and among iterations.

...

As described above, reusing forward intermediate results during backward computation will reduce the computation amount and improve backward performance a lot. A reserved workspace buffer is defined as a private member of RNNOp class. This buffer will store the intermediate results during forward computation and be reused during backward computation. If the operator instance is cached and reused in other iterations, this workspace buffer will also be reused. Workspace buffer will be released when the operator instance is destructed.

Performance

LSTM

To demonstrate the performance of LSTM layer with our fused RNN operator, we leverage those sizes and parameters for Deep Speech 2 model, in which:

seq_length = 300, batch_size = 20, input_size = 800, hidden_size = 800, single direction.

...

Single layer and single direction test as below:

P10099%

T, N, I, H = 300, 20, 800, 800 layer=1 bidirection=False samples/sec	SKX8180				Speedup
	LSTMCell	sym.RNN	sym.RNN/LSTMCell	LSTMCell	sym.RNN	sym.RNN(8180)/sym.RNN(P100)	sym.RNN(8180)/LSTMCell(P100)
Non-FusedRNN	FusedRNN	FusedRNN/Non-FusedRNN
LSTM-Inference	187.09	394.73	210.	399.76	1049.813658	37.60%	98.88%	98%
LSTM-Training(fwd+bwd)	73.23	153.53	209.65%	118.65	339.80665	45.18%	129.4%

For 5-layer LSTM, we can get below performance and speedup on SKX8180 with 2 sockets:

GRU-Inference	128.21	392.16	305.87%
GRU-Training(fwd+bwd)	80.32	171.91	214.03%
vRNN(Relu)-Inference	518.13	1538.46	296.92%
vRNN(Relu)-Training(fwd+bwd)	202.02	357.14	176.79%
vRNN(Tanh)-Inference	492.61	952.38	193.33%
vRNN(Tanh)-
samples/sec	SKX8180			P100		Speedup
samples/sec	LSTMCell	sym.RNN	sym.RNN/LSTMCell	LSTMCell	sym.RNN	sym.RNN(8180)/sym.RNN(P100)	sym.RNN(8180)/LSTMCell(P100)
Inference	37.24	107.13	287.64%	86.45	329.78	32.48%	123.92%
Training(fwd+bwd)	12198.9302	32318.2998	249161.66%	25.45	124.13	26.01%	126.85%

GRU

Same with LSTM benchmark, sizes and parameters for GRU layer are also from DS2 model:

seq_length = 300, batch_size = 20, input_size = 800, hidden_size = 800, single direction

Single layer performance on SKX8180 with 2 sockets:

08%

5-layers (vRNN/LSTM/GRU) with bi-direction test as below:

samples=20	SKX-8180			P100		Speedup
T, N, I, H = 300, 20, 800, 800 layer=1 bidirection=False samples/sec	SKX8180		Speedup
	Non-FusedRNN	FusedRNN	FusedRNN/Non-FusedRNN
LSTM-Inference	37.24	107.13	287.67%
LSTM-Training(fwd+bwd)	12.93	32.29	249.73%
GRU-Inference	26.67	88.9	333.33%
GRU-
samples/sec	GRUCell	sym.RNN	sym.RNN/GRUCell	GRUCell	sym.RNN	sym.RNN(8180)/sym.RNN(P100)	sym.RNN(8180)/GRUCell(P100)
Inference	128.21	392.16	306%	180.18	952.38	41%	218%
Training(fwd+bwd)	8015.3204	17139.912	216%	126260.58	338.98	51%	137%

For 5-layer GRU, performance on SKX8180 with 2 sockets:

samples=20	SKX-8180			P100		Speedup
64%
vRNN(Relu)-Inference	40.73	134.23	329.53%
vRNN(Relu)-Training(fwd+bwd)	22.60	35.97	159.17%
vRNN(Tanh)-Inference	38.91	104.17	267.71%
vRNN(Tanh)-
samples/sec	GRUCell	sym.RNN	sym.RNN/GRUCell	GRUCell	sym.RNN	sym.RNN(8180)/sym.RNN(P100)	sym.RNN(8180)/GRUCell(P100)
Inference	26.67	88.9	333%	40.57	357.14	25%	219%
Training(fwd+bwd)	1522.0473	3934.201	261%	27.62	140.85	28%	142%149.66%

Upstream

LSTM, PR#10104, GRU PR#10311, vRNN PR#11399 : Merged.

...

Page tree

Versions Compared

Old Version 9

New Version Current

Key

Problem Statement

Image Added

Operator Registration

Performance

LSTM

GRU

Upstream

Page tree

Page History

Versions Compared

Old Version 9

New Version Current

Key

Problem Statement

Image Added

Operator Registration

Performance

LSTM

GRU

Upstream