Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

A full tutorial is provided here but we'll summarize for a simple use case below.

Installation

Installing MXNet with TensorRT integration is an easy process. First ensure that you are running Ubuntu 16.04, that you have updated your video drivers, and you have installed CUDA 9.0 or 9.2. You’ll need a Pascal or newer generation NVIDIA gpu. You’ll also have to download and install TensorRT libraries instructions here. Once your these prerequisites installed and up-to-date you can install a special build of MXNet with TensorRT support enabled via PyPi and pip. Install the appropriate version by running:

...

Code Block
nvidia-docker run -ti mxnet/tensorrt bash

Model Initialization

Code Block
import mxnet as mx
from mxnet.gluon.model_zoo import vision
import time
import os

batch_shape = (1, 3, 224, 224)
resnet18 = vision.resnet18_v2(pretrained=True)
resnet18.hybridize()
resnet18.forward(mx.nd.zeros(batch_shape))
resnet18.export('resnet18_v2')
sym, arg_params, aux_params = mx.model.load_checkpoint('resnet18_v2', 0)


Baseline MXNet Network Performance


Code Block
# Create sample input
input = mx.nd.zeros(batch_shape)

# Execute with MXNet
os.environ['MXNET_USE_TENSORRT'] = '0'
executor = sym.simple_bind(ctx=mx.gpu(0), data=batch_shape, grad_req='null', force_rebind=True)
executor.copy_params_from(arg_params, aux_params)

# Warmup
print('Warming up MXNet')
for i in range(0, 10):
	y_gen = executor.forward(is_train=False, data=input)
	y_gen[0].wait_to_read()

# Timing
print('Starting MXNet timed run')
start = time.process_time()
for i in range(0, 10000):
	y_gen = executor.forward(is_train=False, data=input)
	y_gen[0].wait_to_read()
end = time.time()
print(time.process_time() - start)


TensorRT Integrated Network Performance


Code Block
# Execute with TensorRT
print('Building TensorRT engine')
os.environ['MXNET_USE_TENSORRT'] = '1'
arg_params.update(aux_params)
all_params = dict([(k, v.as_in_context(mx.gpu(0))) for k, v in arg_params.items()])
executor = mx.contrib.tensorrt.tensorrt_bind(sym, ctx=mx.gpu(0), all_params=all_params,
                                             data=batch_shape, grad_req='null', force_rebind=True)
# Warmup
print('Warming up TensorRT')
for i in range(0, 10):
	y_gen = executor.forward(is_train=False, data=input)
	y_gen[0].wait_to_read()

# Timing
print('Starting TensorRT timed run')
start = time.process_time()
for i in range(0, 10000):
	y_gen = executor.forward(is_train=False, data=input)
	y_gen[0].wait_to_read()
end = time.time()
print(time.process_time() - start)


Benchmarking


Roadmap

Benchmarks

TensorRT is still an experimental feature, so benchmarks are likely to improve over time.  As of Oct 11, 2018 we've measured the following improvements which have all been run with FP32 weighted networks.

...