Bring your own Accelerator

Link to dev List discussion

https://lists.apache.org/thread.html/464712f0136fb51916ca9f1b702b99847e108dbdbd0b6a2b73fc91f1@%3Cdev.mxnet.apache.org%3E

Feature Shepherd

Need volunteer to help shepherd

Problem

Adding backend support for new accelerators is difficult and requires making changes the MXNet source code. This is difficult and has an extensive learning curve for accelerator vendors to come up to speed on how to modify MXNet.

Proposed Approach

In this project, we will create a set of abstractions through an API that allows accelerator vendors to create external libraries to interface their custom hardware to MXNet without modifying the MXNet code base. We'll streamline how MXNet interacts with processors, and create a user-facing API to dynamically load accelerator libraries at runtime. This will allow accelerator vendors to distribute their library separately from MXNet, decoupling the release of MXNet versions from accelerator library versions.

Here is one example diagram showing the interaction of the accelerator library with MXNet:

As shown above, there are 4 interfaces:

Processor Info - a discovery API for the library to provide context info to MXNet
Supported Ops - a mechanism for MXNet to check if an operator and a particular set of inputs/outputs/attributes can be executed on the processor
Executor - a mechanism for giving a set of work to the processor's execution engine
Notify - a mechanism for the processor's execution engine to let MXNet know about completion of work

Each of these interfaces will have a set of API for the low level operations required (ie. transferring graphs, input data, returning output data, etc.).

UserExperience

ML users will interact with this new feature by calling an API to load an accelerator library:

Python API
import mxnet as mx #load accelerator library acc = mx.context.load_acc("/path/to/libmyacc.so")

Accelerator vendors will interact with this new feature by creating a library that implements functions defined in a header file "mxnet_acc.h":

mxnet_acc.h

#include <string>

typedef int (FCompute)(int, void*);

extern "C" std::string getAccName();

extern "C" FCompute* getFCompute(std::string);

myacc.cpp - example accelerator library implementation

#include "mxnet_acc.h"

std::string getAccName() {

  return std::string("myacc");

extern "C" FCompute* getFCompute(std::string) {

  return 0;

}

Then, accelerator vendors compile their library like:

g++ -shared -fPIC myacc.cpp -o libmyacc.so -I ../../include/mxnet

Goals/Usecases

As an ML user, I’d like to have a simple interface that allows me to use use a custom accelerator for training/inferring deep learning models.

As an accelerator vendor, I’d like to have a create an MXNet interface for my accelerator without having to be an expert in how the MXNet backend works.

Open Questions

What should the set of APIs be for accelerators to hook into the MXNet backend?

Proposed Approach

Prototype for front-end library loading returning an MXNet context has be implemented here: https://github.com/samskalicky/incubator-mxnet/tree/accel_api

Accelerator libraries will implement the functions defined in the header file "mxnet_acc.h": https://github.com/samskalicky/incubator-mxnet/blob/95a7ab06b6ab30a014a497db0d98cf62fa35df84/include/mxnet/mxnet_acc.h

Here is an example library implementation: https://github.com/samskalicky/incubator-mxnet/blob/95a7ab06b6ab30a014a497db0d98cf62fa35df84/example/accel_api/myacc.cpp

For data allocation, MXNet already has an abstraction for managing storage StorageManager. For this feature, we inherit from this class to call functions from the accelerator library.

Memory Management	https://github.com/samskalicky/incubator-mxnet/blob/95a7ab06b6ab30a014a497db0d98cf62fa35df84/src/storage/acc_storage_manager.h#L68-L84
Allocate on acc	extern "C" void* alloc(std::size_t size);
free on acc	extern "C" void free(void*);
direct-free on acc	extern "C" void directFree(void*);
release all (free all)	extern "C" void releaseAll();

For data movement, MXNet already has a templated Copy<to,from> approach. For this feature, we'll just leverage this to call functions from the accelerator library.

Data movement	https://github.com/samskalicky/incubator-mxnet/blob/accel_api/src/ndarray/ndarray_function.cc#L54-L88
Copy from host-to-acc	extern "C" int copyTo(void* dst, void* src, size_t size);
Copy from acc-to-host	extern "C" int copyFrom(void* dst, void* src, size_t size);
Copy within acc	extern "C" int copyBetween(void* dst, void* src, size_t size);

Addition of New APIs

Python - context.py

def load_acc(path_to_lib)

Inputs: path to accelerator library

Returns: context to first instance of accelerator (0)

C API - c_api.cc

int MXLoadAccLib(const char *path, int *id, char *name)

Inputs: path to accelerator library, pointer to integer for dev_type and pointer to char array for context name are set by this function

Returns: success status of loading library and initializing context

Backward compatibility

No issues, this is a new functionality. Existing custom hardware backends for MKL/MKL-DNN/CUDNN/TensorRT will continue working.

Performance Considerations

We will performance analyze the overheads introduced by using a dynamically loaded library by creating a test accelerator library that simply reuses the existing CPU and GPU operator implementations. Then we'll compare these "accelerators" agains the current CPU and GPU contexts.

Test Plan

We will create a test accelerator library that simply reuses the existing CPU and GPU operator implementations an run all existing unit tests.

Alternative Approaches

Currently, custom accelerators like TensorRT must be implemented by modifying the MXNet backend and learning how MXNet works at the lowest level. The team that implemented TensorRT support in MXNet ran through many hurdles and the learnings from that effort are being applied in this proposal.

Technical Challenges

We'll need to version the MXNet operators with accelerator libraries so that as operator implementations change we catch the mismatch against older accelerator libraries.

Milestones

TBD

References

Subgraph Project description within MXNet
https://cwiki.apache.org/confluence/display/MXNET/Unified+integration+with+external+backend+libraries

Related project using subgraph API for MKLDNN
https://cwiki.apache.org/confluence/display/MXNET/MXNet+Graph+Optimization+and+Quantization+based+on+subgraph+and+MKL-DNN-
TensorRT runtime integration proposal
https://cwiki.apache.org/confluence/display/MXNET/Runtime+Integration+with+TensorRT
TensorRT Graph pass to produce TRT-specific subgraphs
https://github.com/apache/incubator-mxnet/blob/5a9c3af4b101b85047b306575575ea9022a8474e/src/executor/tensorrt_pass.cc
TensorRT Operator registration inside MXNet
https://github.com/apache/incubator-mxnet/blob/5a9c3af4b101b85047b306575575ea9022a8474e/src/operator/contrib/tensorrt.cc
(Also see other tensorrt.* files in the contrib directory)
ONNX to TensorRT conversion
https://github.com/apache/incubator-mxnet/blob/e04565aa9534bdc35bdca57e67ef2307cb473792/src/executor/onnx_to_tensorrt.cc
TensorRT execution at runtime
https://github.com/apache/incubator-mxnet/blob/2becd7641fbe264b72425fe6b1ded00cea19d3a8/src/c_api/c_api_executor.cc#L446
https://github.com/apache/incubator-mxnet/blob/5a9c3af4b101b85047b306575575ea9022a8474e/src/executor/trt_graph_executor.cc
TensorRT Unit Tests
https://github.com/apache/incubator-mxnet/tree/5a9c3af4b101b85047b306575575ea9022a8474e/tests/python/tensorrt
TensorRT Tutorial
https://github.com/apache/incubator-mxnet/blob/cb6815c3963e1a97af1aa51fe5d45390d9fddfbc/docs/tutorials/tensorrt/inference_with_trt.md

Page tree

Bring your own Accelerator

Link to dev List discussion

Feature Shepherd

Problem

Proposed Approach

UserExperience

Goals/Usecases

Open Questions

Proposed Approach

Addition of New APIs

Backward compatibility

Performance Considerations

Test Plan

Alternative Approaches

Technical Challenges

Milestones

References