Page History

...

Since TensorRT can only determine all possible optimizations once the tensor shapes are known, it is imperative that all the shape information be provided. This means that the best time to construct the TensorRT graph is bind time. The coming PR can selectively apply the TensorRT optimization for inference-only graphs at symbol bind time. This is in fact consistent with the assumptions about TensorRT made on the MxNet Wiki here.
Since as mentioned in #1, TensorRT graph building needs shape information only available at bind time, an important goal was not to disrupt any existing APIs. Even though C++ permits default function arguments, the Python bindings for symbol-related methods (e.g. simple bind) are exposed via a C, not C++, API, wired on the Python side using Ctypes (e.g. see here for the simple bind integration). This precludes the addition of extra arguments without causing breaking changes in the C API. Also, adapting the Python code to such changes wouldn’t be enough, since all frontend languages use the C (not C++) API for the FFI. Fortunately, C API changes could be avoided, by simply letting the user enable or disable the TensorRT pass using an environment variable ('USE<CODE>USE_TENSORRT=1' 1</CODE> to enable). This also does not diminish the flexibility of the integration, since the graph pass can read the environment variable each time symbol binding is done, and hence permits turning the graph passes on and off, depending on need. The ability to enable and disable the TensorRT pass at runtime also makes unit testing easier.
TensorRT requires that the workspace size is provided at graph construction time. This value constitutes the upper limit on the amount of memory that TensorRT can use, and does not determine immediate use. Since this amount can be hard for the user to know, its limit should be set to a reasonable value that the user need not concern themselves with. Given that TensorRT integration is applied at bind time and that TensorRT engines wrapped in TensorRT nodes are constructed during the graph pass rather than the memory allocation pass, MxNet will only allocate the amount needed for the nodes remaining after the TensorRT subgraphs have been extracted. This means that no memory will be doubly allocated - first for the complete MxNet subgraph and then for TensorRT. However, the question remains whether the memory used per TensorRT engine should be a configurable parameter, either as a method argument or an environment variable, or whether TensorRT should be able to use the maximum available GPU memory and then reserve only what it needs. I would like to suggest the latter. Since the TensorRT subgraph will typically use less memory than the same subgraph in MxNet (due to more layer fusion), it’s extremely unlikely that a model which runs purely as an MxNet graph would fail with an ouf of memory error when parts or most of the graph run inside TensorRT. Fewer knobs (in this case, not giving the user the ability to tweak the maximum amount of memory availble to TensorRT would simplify use.
TensorRT can accept graphs constructed using two main approaches: (a) via the TensorRT graph API, (b) using ONNX. Approach (a) seems simple on the surface - one traverses the NNVM graph, finds subgraphs that TensorRT can execute, converts the subgraphs to TensorRT graphs, and substitutes the subgraphs with TensorRT nodes, each of which contain the TensorRT engine corresponding to the subgraph. However, the approach taken by NVIDA was to use ONNX as tha IR. The reason for this is twofold. First, ONNX is a very well-known IR, which is supported by the entire deep learning software community. This ensures that the design of the IR gets as much feedback as possible as to whether the IR is feature complete, and what the semantics are. NVIDIA already maintains an ONNX-to-TensorRT converter (link), and will continue to do so. Whatever changes that may apply to the TensorRT APIs or the internal features may be nicely hidden behind the well-established ONNX IR. Second, ONNX is growing beyond being merely an IR. As it becomes more of a standard, its adoption will be associated with other benefits, such as the ability to verify standard compliance.
Despite the advantages of using the ONNX route described in #4, there are some costs. The main one is the dependency on Protobuf. This is a valid criticism on the surface, however, since the TensorRT integration requires an opt-in during build time, adding one more dependency is not a problem if it is not a mandatory dependency. Moreover, the same Protobuf dependency already exists for the MxNet ONNX importer, which is now part of the MxNet source tree (link), rather than being located in a separate repository. Just like the use of the ONNX importer is optional and requires ONNX (and hence also Protobuf), the TensorRT build is optional.
The optional integration of TensorRT will be guarded using a config.mk flag (USE_TENSORRT), which will function similarly to other flags, such as USE_CUDA, USE_CUDNN, etc. Needless to say, USE_TENSORRT will depend on CUDA and cuDNN.
In order to simplify evaluation of the TensorRT build, usability and to run unit tests, the PR will come with a Dockerfile, which will allow anyone to build MxNet with TensorRT, along with its dependencies, i.e. Protobuf and ONNX.

...

Page tree

Versions Compared

Old Version 9

New Version 10

Key