Page History

...

Since the new accelerator API proposal (link) was only published a few days ago and the implementation is still on an MxNet fork, the current TensorRT integration doesn’t use that API yet, but could be refactored in a future commit to use it. There is nothing in the current design that would prevent making use of that API in the near future.
Building the TensorRT engine takes a non-trivial amount of time, because the compiler evaluates performance and the hardware on the system before creating the fused layers on demand, and then needs to actually compile them. For ResNet-50 this may be a few seconds, but larger models also exist which may take longer. TensorRT comes with the ability to serialize the TensorRT engine for a particular hardware platform. This is called the serialization of a TensorRT plan, which is the engine along with the ahead-of-time-compiled fused kernels for a given GPU. The first PR of the TensorRT integration will not provide for TensorRT plan caching, so using TensorRT might have a small start-up cost, but for long-running inference processes, this shouldn’t be a problem. Caching the TensorRT plan will be addressed in a future commit.
As mentioned before, the reproducibility of the build will be demonstrated using a Dockerfile that will provide an easy way to evaluate the build. This will be a CI Dockerfile, which could be re-used as an example of building on bare metal, or for building non-CI Docker images. The Docker recipe was tested on Linux on x86_64, but not other platforms supported by TensorRT (Linux on 64-bit ARM (aarch64), Android on aarch64, QNX on aarch64). Supporting other platforms, e.g. Linux on aarch64 (e.g. L4T, i.e. Linux for Tegra, on the NVIDIA Jetson platform) is left for subsequent commits.
The current commit supports many, but not all, of TensorRT operators. For example, this integration can run CNNs such as VGG, or ResNet, but not necessarily everything that TensorRT can support. More operators will be covered in future commits.
TensorRT supports plugins, which can be integrated into the graph pass. However, this was not a priority since the runtime TensorRT integration can always fall back to existing MxNet operators. Supporting plugins is possible, but will be added in future commits.
The upcoming PR will support fp16 and fp32, but not int8. Since int8 support in MxNet is itself very new, figuring out calibration and other details is left for a future commit.
TensorRT 4 is going to have has a new feature called BYOM (bring your own memory). This means that instead of telling TensorRT how much memory it can use, the data/scratch space tensors can be provided by MxNet, and can be re-used by MxNet when not running the forward pass. The memory in permanent use will then be limited to TensorRT storing weights. Support for this feature will be added in a future commit.

...

Page tree

Versions Compared

Old Version 25

New Version 26

Key