Page History

...

When a subgraph is found, the partitioning algorithm invokes SubgraphProperty::CreateSubgraphNode to create a new node for the subgraph and connects the new node back to the original graph to replace the subgraph. The subgraph passed to CreateSubgraphNode contains shape/dtype/storage information of the input nodes. This is important because accelerators, such as TVM and TensorRT, need these information to compile and select the best kernel. CreateSubgraphNode allows any customization on the subgraph and the node, which gives users many opportunities to optimize the subgraph. For example, TensorRT can optimize a computation graph based on its input shapes and data types in CreateSubgraphNode; some of the accelerators, such as TVM and MKLDNN, can perform another level of graph partitioning for operator fusion.

To perform graph partitioning, we attach a graph property (a class that implements SubgraphProperty) and invoke PartitionGraph. This should be done in bind for Symbol executor and in CachedOp for Gluon hybridize.

g.attrs["subgraph_property"] = std::make_shared<nnvm::any>(std::move(property));
g = ApplyPass(std::move(g), "PartitionGraph");

This should be done in bind for Symbol executor and in CachedOp for Gluon hybridize. We only need to partition the forward graph. The backward graph will be generated accordingly.

Step 2: subgraph operator (function call)
Although there are two levels of graph partitioning, we only need to handle one level of subgraphs in the executor because the subgraphs in the second level are fused into operators. We can execute these subgraphs inside special operators, which is specific to the accelerator.

TVM execution operator: loads a subgraph from a TVM compiled binary, a graph JSON file and weight arrays, and executes the subgraph composed of fused operators. We can first use the TVM executor to execute the subgraph, but in the future we should use the MXNet executor because MXNet executes operators in multiple threads, which is useful for task parallelism. The operator needs to convert all output NDArrays of the subgraph to the default format.
MKLDNN execution operator: gets a subgraph from the first step and runs operators in the MXNet executors. Like TVM operator, this operator also needs to convert all output NDArrays of the subgraph to the default format.
TensorRT has its engine for executing the optimized subgraph.
nGraph execution operators: it's up to the Intel folks, most likely similar to the MKLDNN operatorTensorRT.

To customize the subgraph execution, an accelerator needs to provide their own operator implementation and attach the operator to the subgraph node when SubgraphProperty::CreateSubgraphNode is called. The subgraph operator should be a stateful operator and contain a computation graph. We provide a default subgraph operator implementation (“_subgraph_op”) that executes operators with MXNet Executor.

For fast inference in TVM and MKLDNN, the subgraph operators need to maintain a copy of weight arrays (similar to the closure of a function). In this way, we can convert the data format of the weight arrays and cache the array inside the subgraph operator to avoid any redundant format conversion. The original weight arrays will still be part of the inputs of the subgraph operator. Even though the weight arrays are normally not modified, we still need to handle this case correctly. One solution is to maintain a version number for the var of an NDArray, which is increased by one whenever the NDArray is modified in the execution engine. We can use the version number to determine the weight arrays have been modified whenever the subgraph operator is invoked.

The benefit of invoking a subgraph inside an operator
Introducing a subgraph operator for TVM and MKLDNN may sound like unnecessary complexity. It actually significantly reduces the complexity of the integration. By using the subgraph operator, we can completely isolate TVM operators and MKLDNN operators from MXNet operators as well as the default MXNet memory planning. Inside the subgraph operators, we don't need to deal with data format conversion and can use a completely different memory plan for the subgraph.

...

Page tree

Versions Compared

Old Version 7

New Version 8

Key