Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

MXNet is a high performance machine learning framework and leverages many high performance tools and libraries in the backend such as MKLDNN, cuDNN, TensorRT, and nGraph among others. Another  Some recent backend was added additions to MXNet for are: TensorRT (subgraph), and Elastic Inference. Adding each of these backends required modifying the MXNet source code, deep knowledge of how MXNet works, and months of time to work with the community to add custom processor-specific changes.

...

“Bring your own Accelerator” is a set of Accelerator APIs that allow MXNet to interface with any custom ML accelerator chip or ML library. It will bring a new differentiator to MXNet that other ML frameworks lack.

The main problem with adding new backends to MXNet is adding the new functionality to the MXNet code base, recompiling MXNet, and upstreaming the changes (requiring community support/approval). The library approach we present will enable new backends to be compiled separately from the MXNet code base and will not require linking against all of MXNet's 3rd party dependencies (ie. TVM, NNVM, etc.). A single header file mxnet_acc.h will be used to define the APIs between MXNet and accelerator libraries.

The accelerator library will be loaded via dlopen dynamically in the MXNet backend and the APIs will be located in the library using dlsym (standard posix functions from dlfcn.h). Similar functions exist in Windows (LoadLibrary and GetProcAddress). We will use C types/structs to eliminate the compiler version/compatibility issue. This eliminates the requirement for new backends to be compiled or linked against MXNet, or even using the same compiler.

In terms of operator coverage, we cannot expect that an accelerator supports every operator that MXNet has. Instead we will follow the same subgraphing/partitioning scheme that MXNet already supports is using now where the CPU context will be used for any operators not supported by the accelerator.

...

  • As a library with a set of APIs to execute individual operators (ie. cuDNN, MKLDNN). We'll call this the imperative execution mode.
  • As a library that pre-processes the whole graph first and then executes it via a LoadModel/Infer type of API (ie. TensorRT, nGraph, TVM/Neo, EIA). We'll call this the symbolic execution mode.

The core foundation of operator execution in MXNet is via the imperative mode. Symbolic mode is already implemented in MXNet by executing individual operators imperatively. For accelerators that do not directly support imperative mode, imperative mode can be supported by creating graphs with a single operator inside and executing each symbolically.In this proposal, we will focus on the symbolic mode.

User experience for ML/DL scientists:

...

import mxnet as mx

#load accelerator library, returns a context with device id 0
ctx = mx.load_acc("/path/to/libmyacc.so")

#after loading library, accel context can also be created by
ctx = mx.acc()
ctx = mx.acc(0)

#can also list the available accelerators just like
#mx.test_utils.list_gpus(), returns [0, 1, ...]
ctx_list = []
acc_list = mx.test_utils.list_acc(mx.acc())
for i in acc_list:
ctx_list.append(mx.acc(i))

#bind model`
sym, arg_params, aux_params = mx.model.load_checkpoint(NAME, EPOCH)
mod = mx.mod.Module(symbol=sym, context=ctx)
mod.bind(data_shapes=[('data', (1,3,224,224))], label_shapes=mod._label_shapes)
mod.set_params(arg_params, aux_params, allow_missing=True)

#forward pass
mx_img = mx.nd.array(IMG, ctx=ctx)
Batch = namedtuple('Batch', ['data'])
data = Batch([mx_img])
model.forward(data)

...

Accelerators will also do special things that are new to MXNet like compiling and producing accelerator-specific binaries. Then users can reuse these pre-compiled binaries and avoid compiling when re-running inference at a later time.

Loading Accelerator Libraries

...

MXNet can bundle libraries in with its installation (pip, jar, etc) and can find those libraries during the init process (ie. import mxnet). This will create a better user experience that “just works” for specific use-cases like EIA or FPGA (F1) instances.

Environment Variable

Users can point to a directory of accelerator libraries by setting the MXNET_ACC_LIBRARIES variable. This will make it easier for users to generalize their MXNet code by removing environment-specific paths. This variable will be checked during MXNet's initialization process

...

The main APIs that will be defined in mxnet_acc.h are categorized and described below. These APIs use only C (no C++) to avoid potential problems with using different compilers/STL/ABI.

  • Accelerator Identification
    • GetAccName - returns 3 letter name for the accelerator (ie. returns “eia” for mx.eia() context)
    • void getAccName(char *name);
    • GetNumAcc - returns number of accelerators in the system
    • int getNumAcc();
    • Initialize - MXNet calls this function when library is loaded, passes MXNet version to the library. This is the opportunity for the library to return an error if it cannot be used with a specific version of MXNet.
    • int initialize(int version);
  • NDArray format
    • We need a format to standardize on, that does not require any other submodule dependencies between MXNet and the accelerator library . This is similar to how EIA is implemented with MXNet already: see code here
    • struct MXTensor - interface for ndarrays with accelerator library
    • struct MXTensor
      {
      void *data;
      uint64_t *shape;
      uint64_t shape_len;
      TensorDType dtype;
      int32_t isSparse;
      char *layout; //ie. NCHW
      };
      struct SparseTensor
      { //CSR format
      void *data; //flattend data array
      uint64_t *indices; //column index for each element
      uint64_t *indptr; //offset into data for each row
      };
      enum TensorDType
      {
      kFloat32 = 0,
      kFloat64 = 1,
      kFloat16 = 2,
      kUint8 = 3,
      kInt32 = 4,
      kInt8 = 5,
      kInt64 = 6,
      };
  • Storage & Memory management
    • alloc, free, releaseAll - MXNet memory management functions for array allocation
    • void* mx_alloc(size_t size);
      void mx_free(void* ptr);
      void mx_releaseAll();
    • CopyToAcc - Copies an array on CPU to local acc memory. src is a tensor in CPU memory, dst is a tensor allocated with the alloc function above on the acc.
    • int copyToAcc(MXTensor &dst, const MXTensor &src);
    • CopyFromAcc - Copies an array in local acc memory back to CPU. src is a tensor in local acc memory, dst is a tensor in CPU memory.
    • int copyFromAcc(MXTensor &dst, const MXTensor &src);
  • Execution Mode Support
    • SupportsImperative - returns true if AccGetExecutor can be called to execute in imperative mode and if direct data allocation (alloc, free, and releaseAll APIs) is supported by accelerator library. Otherwise loadModel/Infer will be used.
    • int supportsImperative();
  • Imperative Execution
    • AccGetExecutor - Pass in operator name and inputs/outputs/attrs, returns function pointer that will be called from the engine later as it begins execution. Returns nullptr (0) if operator is not supported.
    • void* accGetExecutor(const char *op_name,
      const MXTensor *inputs,
      const MXTensor *outputs,
      int num_in, int num_out
      const char *attr_keys[], const char *attr_vals[],
      int num_attrs);
  • Symbolic Execution
    • SupportedOps - pass in string json of graph, returns list of IDs of nodes/ops that can run on accelerator. json graph must be annotated with shapes/dtypes, so this This API must be called after MXNet does shape/dtype propagation in order to provide the data types/sizes for each operator. Some accelerators will only support certain operators with certain data size limits, or only for certain data types, so this info is needed to determine if an accelerator can support a particular op.
    • void supportedOps(const char *json,
      const char *data_names[],
      const MXTensor DLTensor *data,
      const int num_data,
      int *ids);
    • LoadModel - Pass in an ID, a string json of graph, map of input data names mapped to tensors data. This json graph is probably not the same graph passed to supportedOps above, since MXNet will perform graph partitioning based on the supported ops of the accelerator. dev_id is the ID of the accelerator in the system.
    • Json-data nodes contain:
    • dtype
    • shape
    • weight/input
    • Will also identify which inputs are weights/params versus input data.
    • int loadModel(const char *model_id,
      const char *json,
      const char *data_names[],
      const MXTensor DLTensor *data,
      int num_data,
      int dev_id);
    • UnloadModel - Pass in an ID for a model loaded with LoadModel, tells accelerator library to free up any memory used for previously loaded model.
    • void unloadModel(const char *model_id);
    • Infer - pass in an ID for model loaded with LoadModel, map of input data names mapped to tensor data for data thats changed. Returns map of data names to output tensor data. This is a blocking call.
    • int infer(const char *model_id,
      const char *in_names[], const char *out_names[],
      const MXTensor DLTensor *in_data, MXTensor DLTensor *out_data
      int num_in, int num_out);

...

  • Generic accelerator configuration
    • pass any keyword/arg mapping into an accelerator, and potentially get some outputs. return number of out entries (in out_keys/out_vals)status/error
    • int configure(const char *in_keys[], char*in_vals[], int num_in,
      char *out_keys[], char *out_vals[], int *num_out);
    • called by the user via configure function on an mxnet accelerator context in any language binding (python shown in the example below):
    • ctx = mx.acc()
      status = ctx.configure(init=str(True), other='thing')
      status = ctx.configure(call='getStatus')
      drink = ctx.configure(call='getMeACoffee')

...