Unified integration with external acceleration libraries

MXNet can integrate with many different kinds of accelerators, including TVM, MKLDNN, TensorRT, Intel nGraph and more. These accelerators in general support a limited number of operators, and thus running computation in a model usually involves in interaction between accelerator operators and MXNet operators.

These accelerators share some common requirements:

TVM , MKLDNN and nGraph uses customized data formats. Interaction between these accelerators with MXNet requires data format conversion.
TVM, MKLDNN, TensorRT and nGraph fuses operators.

Integration with these accelerators should happen in the granularity of subgraphs instead of in the granularity of operators. To fuse operators, it's obvious that we need to divide a graph into subgraphs so that the operators in a subgraph can be fused into a single operator. To handle customized data formats, we should partition a computation graph into subgraphs as well. Each subgraph contains only TVM, MKLDNN or ngraph operators. In this way, MXNet converts data formats only when entering such a subgraph and the operators inside a subgraph handle format conversion themselves if necessary. This makes interaction of TVM and MKLDNN with MXNet much easier. Neither the MXNet executor nor the MXNet operators need to deal with customized data formats.

As such, integration with these accelerators may result in two levels of graph partitioning. In the first level, a subgraph only contains the operators supported by the accelerator. In the second level, a subgraph only contains the operators that can be fused.

TVM requires two levels of partitioning because TVM searches for global scheduling among fused operators in order to achieve the best performance.
MKLDNN requires two levels of partitioning. We want to isolate MKLDNN operators from MXNet operators so that we know where to insert MKLDNN format conversion operators. MKLDNN also wants to fuse operators to achieve the optimal performance.
TensorRT also only supports a small set of operators and performs a graph transformation internally to fuse operators both vertically and horizontally for better performance.
nGraph probably has the same requirement as MKLDNN.

The partitioning and execution of these accelerators can be different. As such, we define the following interface for accelerators to customize graph partitioning and operator execution.

class SubgraphProperty {
 public:
  // the criteria of selecting the subgraph nodes.
  virtual SubgraphSelectorPtr CreateSubgraphSelector() const = 0;
  // create an nnvm node for a given subgraph. Here users can customize how to
  // execute the operators in the subgraph.
  virtual nnvm::NodePtr CreateSubgraphNode(const nnvm::Symbol &s) const = 0;
  // Create a subgraph operator for execution.
  virtual OpStatePtr CreateSubgraphOperator(const nnvm::Symbol &sym) const = 0;
  // The type of the subgraph.
  virtual std::string GetType() const = 0;
};

Step 1: graph partition
Graph partitioning is to traverse a computation graph and group operators into subgraphs based on certain rules. There already exists an TVM fuse pass in NNVM, which groups operators into subgraphs based on certain general rules (e.g., convolution followed by element-wise operations). This graph partitioner is TVM-specific. It doesn't work for other accelerators. We need more graph partitioners. For example, TensorRT and MKLDNN requires a partitioner that finds subgraphs with specific patterns (e.g., convolution followed by batchnorm, followed by activation, etc).

Regardless of the diverse partitioning requirements, we assume all graph partitioning shares the following requirements:

all nodes in a subgraph should be connected via either incoming links or outgoing links or both.
a node can't belong to two or more subgraphs.

Given these assumptions, we traverse from every node in a graph and explore their neighbor nodes with rules provided by users. The interface below defines the selection rules and users can use it to customize node selection. Given this interface, users can determine which edges to follow to generate a subgraph. Each time, the selector is called, it sees a new node that connects to one of the nodes in the subgraph and hasn't been selected before, and determine whether to add the node in the subgraph. When traversing from a starting node, a new selector will be created. The selector can change its own state when seeing a new node.

class SubgraphSelector {
 public:
  virtual bool Select(const nnvm::Node &n) = 0;
  virtual bool UseIncomingEdges() const = 0;
  virtual bool UseOutgoingEdges() const = 0;
};

All of the accelerators will need a selector that extracts a subgraph with operators supported by the accelerators. As such, we provide a selector called ContainOpSelector for this purpose.

To perform graph partitioning, we attach a graph property (a class that implement SubgraphProperty) and invoke PartitionGraph.

g.attrs["subgraph_property"] = std::make_shared<nnvm::any>(std::move(property));
g = ApplyPass(std::move(g), "PartitionGraph");

Some of the accelerators, such as TVM, MKLDNN and TensorRT, need another level of graph partitioning for fused operators. TVM and TensorRT provide their own mechanism for extracting a sequence of operators for fusion. The second level partitioning can happen in different places for different accelerators. Typically, it happens when a node is created for a subgraph (i.e., SubgraphProperty::CreateSubgraphNode). However, TensorRT optimizes a computation graph based on its input shapes and data types. As such, the optimization should happen during the shape and data type inference in the subgraph operator, or even the first time that the subgraph is executed.

Step 2: subgraph operator (function call)
Although there are two levels of graph partitioning, we only need to handle one level of subgraphs in the executor because the subgraphs in the second level are fused into operators. We can execute these subgraphs inside special operators, which is specific to the accelerator.

TVM execution operator: loads a subgraph from a TVM compiled binary, a graph JSON file and weight arrays, and executes the subgraph composed of fused operators. We can first use the TVM executor to execute the subgraph, but in the future we should use the MXNet executor because MXNet executes operators in multiple threads, which is useful for task parallelism. The operator needs to convert all output NDArrays of the subgraph to the default format.
MKLDNN execution operator: gets a subgraph from the first step and runs operators in the MXNet executors. Like TVM operator, this operator also needs to convert all output NDArrays of the subgraph to the default format.
TensorRT has its engine for executing the optimized subgraph.
nGraph execution operators: it's up to the Intel folks, most likely similar to the MKLDNN operator.

To customize the subgraph execution, an accelerator needs to provide their own subgraph operator in SubgraphProperty::CreateSubgraphOperator. The subgraph operator is stateful and contains the computation graph. We provide a default subgraph operator implementation that executes operators with MXNet Executor.

class SubgraphOperator {
  nnvm::Symbol subgraph_sym_;
public:
  SubgraphOperator(const nnvm::Symbol &sym) {
    this->subgraph_sym_ = sym;
  }

  virtual ~SubgraphOperator() {
  }

  const nnvm::Symbol &GetSubgraph() const {
    return subgraph_sym_;
  }

  virtual void Forward(const OpContext& ctx,
                       const std::vector<NDArray>& inputs,
                       const std::vector<OpReqType>& req,
                       const std::vector<NDArray>& outputs) = 0;
  virtual void Backward(const OpContext& ctx,
                        const std::vector<NDArray>& inputs,
                        const std::vector<OpReqType>& req,
                        const std::vector<NDArray>& outputs) = 0;
};

For fast inference in TVM and MKLDNN, the subgraph operators need to maintain a copy of weight arrays (similar to the closure of a function). In this way, we can convert the data format of the weight arrays and cache the array inside the subgraph operator to avoid any redundant format conversion. The original weight arrays will still be part of the inputs of the subgraph operator. Even though the weight arrays are normally not modified, we still need to handle this case correctly. One solution is to maintain a version number for the var of an NDArray, which is increased by one whenever the NDArray is modified in the execution engine. We can use the version number to determine the weight arrays have been modified whenever the subgraph operator is invoked.

The benefit of invoking a subgraph inside an operator
Introducing a subgraph operator for TVM and MKLDNN may sound like unnecessary complexity. It actually significantly reduces the complexity of the integration. By using the subgraph operator, we can completely isolate TVM operators and MKLDNN operators from MXNet operators as well as the default MXNet memory planning. Inside the subgraph operators, we don't need to deal with data format conversion and can use a completely different memory plan for the subgraph.

Page tree

Unified integration with external acceleration libraries